[jira] [Updated] (YARN-4862) Handle duplicated completed containers in RMNodeImpl
[ https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4862: Description: As per [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689] from [~sharadag], there should be safe guard for duplicated container status in RMNodeImpl before creating UpdatedContainerInfo. Or else in heavily loaded cluster where event processing is gradually slow, if any duplicated container are sent to RM(may be bug in NM also), there is significant impact that RMNodImpl always create UpdatedContainerInfo for duplicated containers. This result in increase in the heap memory and causes problem like YARN-4852. This is an optimization for issue kind YARN-4852 was: As per [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689] from [~sharadag], there should be safe guard for duplicated container status in RMNodeImpl before creating UpdatedContainerInfo. Or else in heavily loaded cluster, if any duplicated container are sent to RM(may be bug in NM also), RM should not create UpdatedContainerInfo for duplicated containers. This is an optimization for issue kind YARN-4852 > Handle duplicated completed containers in RMNodeImpl > > > Key: YARN-4862 > URL: https://issues.apache.org/jira/browse/YARN-4862 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > > As per > [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689] > from [~sharadag], there should be safe guard for duplicated container status > in RMNodeImpl before creating UpdatedContainerInfo. > Or else in heavily loaded cluster where event processing is gradually slow, > if any duplicated container are sent to RM(may be bug in NM also), there is > significant impact that RMNodImpl always create UpdatedContainerInfo for > duplicated containers. This result in increase in the heap memory and causes > problem like YARN-4852. > This is an optimization for issue kind YARN-4852 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking
[ https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209809#comment-15209809 ] Hadoop QA commented on YARN-4676: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 6 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 18s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 5s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 10s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 7s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 6m 14s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 17s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 7s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s {color} | {color:green} root: patch generated 0 new + 498 unchanged - 4 fixed = 498 total (was 502) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 2m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 9m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 0s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 6m 10s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 10m 2s {color} | {color:red} hadoop-common in the patch failed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 32s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 18s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 3s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 27s {color} | {color:red} hadoop
[jira] [Commented] (YARN-4820) ResourceManager web redirects in HA mode drops query parameters
[ https://issues.apache.org/jira/browse/YARN-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209783#comment-15209783 ] Hudson commented on YARN-4820: -- FAILURE: Integrated in Hadoop-trunk-Commit #9494 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9494/]) YARN-4820. ResourceManager web redirects in HA mode drops query (junping_du: rev 19b645c93801a53d4486f9a7639186525e51f723) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java > ResourceManager web redirects in HA mode drops query parameters > --- > > Key: YARN-4820 > URL: https://issues.apache.org/jira/browse/YARN-4820 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: YARN-4820.001.patch, YARN-4820.002.patch, > YARN-4820.003.patch > > > The RMWebAppFilter redirects http requests from the standby to the active. > However it drops all the query parameters when it does the redirect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4822) Refactor existing Preemption Policy of CS for easier adding new approach to select preemption candidates
[ https://issues.apache.org/jira/browse/YARN-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209733#comment-15209733 ] Hadoop QA commented on YARN-4822: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 18s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 65 new + 47 unchanged - 52 fixed = 112 total (was 99) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 20s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 46s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 170m 39s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_74 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209726#comment-15209726 ] Rohith Sharma K S commented on YARN-4852: - Raised a JIRA YARN-4862 for handling duplicated container status check. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4862) Handle duplicated completed containers in RMNodeImpl
Rohith Sharma K S created YARN-4862: --- Summary: Handle duplicated completed containers in RMNodeImpl Key: YARN-4862 URL: https://issues.apache.org/jira/browse/YARN-4862 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S As per [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689] from [~sharadag], there should be safe guard for duplicated container status in RMNodeImpl before creating UpdatedContainerInfo. Or else in heavily loaded cluster, if any duplicated container are sent to RM(may be bug in NM also), RM should not create UpdatedContainerInfo for duplicated containers. This is an optimization for issue kind YARN-4852 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209715#comment-15209715 ] Rohith Sharma K S commented on YARN-4852: - I will raise a new ticket for this. Thanks:-) > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209689#comment-15209689 ] Sharad Agarwal commented on YARN-4852: -- Thanks Rohith. Should we consider adding duplicate check in the RM side as well for completed containers as we are doing for launched ones. This will make it more full proof and eliminate scenarious like resync etc where NM might still send duplicates. we can open a new ticket for the same. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209609#comment-15209609 ] Rohith Sharma K S commented on YARN-4852: - Adding to above point, since NM->RM is push design, already sent containers are not supposed to send again unless there is RESYNC command from RM. So it should be a bug from NodeManager > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
[ https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209606#comment-15209606 ] Hadoop QA commented on YARN-4436: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 47s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 12s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 11s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 10s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 12s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: patch generated 1 new + 49 unchanged - 2 fixed = 50 total (was 51) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 9s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 10s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 9s {color} | {color:green} hadoop-yarn-applications-distributedshell in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 25s {color} | {color:green} hadoop-yarn-applications-distributedshell in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 26m 59s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:fbe3e86 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12795084/YARN-4436.002.patch | | JIRA Issue | YARN-4436 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 90e15ca288ff 3
[jira] [Updated] (YARN-2883) Queuing of container requests in the NM
[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantinos Karanasos updated YARN-2883: - Attachment: YARN-2883-trunk.005.patch Adding updated patch, after addressing [~chris.douglas]'s comments. Also addressed [~kasha]'s first comments. I added a new JIRA (YARN-4861), so that we address the comment related to the ExitStatus of a killed OPPORTUNISTIC container. Moreover, I did not address the comment about bounding the queue size, as this should be done in a new JIRA too. > Queuing of container requests in the NM > --- > > Key: YARN-2883 > URL: https://issues.apache.org/jira/browse/YARN-2883 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: YARN-2883-trunk.004.patch, YARN-2883-trunk.005.patch, > YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, > YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch > > > We propose to add a queue in each NM, where queueable container requests can > be held. > Based on the available resources in the node and the containers in the queue, > the NM will decide when to allow the execution of a queued container. > In order to ensure the instantaneous start of a guaranteed-start container, > the NM may decide to pre-empt/kill running queueable containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4861) Define ContainerExitStatus for OPPORTUNISTIC containers that get killed
[ https://issues.apache.org/jira/browse/YARN-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209596#comment-15209596 ] Konstantinos Karanasos commented on YARN-4861: -- An OPPORTUNISTIC container might be killed in one of the following cases: * by the AM while running; * by the AM while queued; * by the NM, while running, in order to free up resources for a GUARANTEED container to start its execution; * by the NM, while queued, in order to reduce the length of the queue. In all these cases, we need to define the proper Exit Status for the container. Then, we need to make sure that the AM reacts properly to the defined Exit Statuses (e.g., by rescheduling killed OPPORTUNISTIC containers). Currently, in YARN-2883, OPPORTUNISTIC containers that got killed by the NM while running get a KILLED_BY_APPMASTER ExitStatus. In YARN-4738, OPPORTUNISTIC containers that got killed while queued are get an ABORTED ExitStatus. > Define ContainerExitStatus for OPPORTUNISTIC containers that get killed > --- > > Key: YARN-4861 > URL: https://issues.apache.org/jira/browse/YARN-4861 > Project: Hadoop YARN > Issue Type: Task >Reporter: Konstantinos Karanasos > > When we kill an OPPORTUNISTIC container, which is either running or queued, > we need to define its Exit Status. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209583#comment-15209583 ] Rohith Sharma K S commented on YARN-4852: - Thanks for bringing out duplicated container status stored in UpdatedContainerInfo. This makes to think of ticket YARN-2997 which is already solved. Scenario is NM keeps the containers in NMContext as long as RM sends notification to NM in response to remove from NM. Every heart beat these(pendingCompletedContainers) container status is sent to RM which could be duplicated!! But from RM , while creating UpdatedContainerInfo validation is not done for duplicated entries. This is keep accumulating when there is slow in scheduler event processing. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4861) Define ContainerExitStatus for OPPORTUNISTIC containers that get killed
Konstantinos Karanasos created YARN-4861: Summary: Define ContainerExitStatus for OPPORTUNISTIC containers that get killed Key: YARN-4861 URL: https://issues.apache.org/jira/browse/YARN-4861 Project: Hadoop YARN Issue Type: Task Reporter: Konstantinos Karanasos When we kill an OPPORTUNISTIC container, which is either running or queued, we need to define its Exit Status. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4826) Document configuration of ReservationSystem for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209571#comment-15209571 ] Hadoop QA commented on YARN-4826: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 7m 38s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:fbe3e86 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12795078/YARN-4826.v1.patch | | JIRA Issue | YARN-4826 | | Optional Tests | asflicense mvnsite | | uname | Linux a5c487fd9567 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 938222b | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/10861/console | | Powered by | Apache Yetus 0.2.0 http://yetus.apache.org | This message was automatically generated. > Document configuration of ReservationSystem for CapacityScheduler > - > > Key: YARN-4826 > URL: https://issues.apache.org/jira/browse/YARN-4826 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Subru Krishnan >Assignee: Subru Krishnan >Priority: Minor > Attachments: YARN-4826.v1.patch > > > This JIRA tracks the effort to add documentation on how to configure > ReservationSystem for CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4860) Created Node label disappear after restart Resoure Manager
Yi Zhou created YARN-4860: - Summary: Created Node label disappear after restart Resoure Manager Key: YARN-4860 URL: https://issues.apache.org/jira/browse/YARN-4860 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Yi Zhou In 2.6, if restart RM, it cause created node label to disappear and rm failed to start up Error starting ResourceManager org.apache.hadoop.service.ServiceStateException: java.io.IOException: NodeLabelManager doesn't include label = y, please check. at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:262) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1221) Caused by: java.io.IOException: NodeLabelManager doesn't include label = y, please check. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkIfLabelInClusterNodeLabels(SchedulerUtils.java:287) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.(AbstractCSQueue.java:106) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:120) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:569) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:589) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:464) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:296) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:326) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 7 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209535#comment-15209535 ] Sharad Agarwal commented on YARN-4852: -- Further analysis shows that we are seeing exceptionally high log lines of "Null container completed...", somewhere in between 100k to 200k every minute. This could be related to lot of duplicate UpdatedContainerInfo objects for completed containers. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209520#comment-15209520 ] Sharad Agarwal commented on YARN-4852: -- [~rohithsharma] the slowness in schedulers still does not explain the built up of UpdatedContainerInfo to be 0.5 million objects in a short span. UpdatedContainerInfo should only be created in case of newly launched/completed containers. Looking at the code at RMNodeImpl.StatusUpdateWhenHealthyTransition (branch 2.6.0) {code} // Process running containers if (remoteContainer.getState() == ContainerState.RUNNING) { if (!rmNode.launchedContainers.contains(containerId)) { // Just launched container. RM knows about it the first time. rmNode.launchedContainers.add(containerId); newlyLaunchedContainers.add(remoteContainer); } } else { // A finished container rmNode.launchedContainers.remove(containerId); completedContainers.add(remoteContainer); } } if(newlyLaunchedContainers.size() != 0 || completedContainers.size() != 0) { rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo (newlyLaunchedContainers, completedContainers)); } {code} Above UpdatedContainerInfo is seemed to be getting created each time there is a completed containers in the container status (it is not checking if from previous update this has already been created). Wouldn't this lead to lot of duplicates UpdatedContainerInfo objects and further putting stress on the scheduler unnecessarily. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4822) Refactor existing Preemption Policy of CS for easier adding new approach to select preemption candidates
[ https://issues.apache.org/jira/browse/YARN-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209469#comment-15209469 ] Wangda Tan commented on YARN-4822: -- [~eepayne], [~sunilg], [~jianhe], Appreciate if you could take a look at latest patch, it contains a couple of refactorings: - PCPP becomes 2 parts: 1) Basic code such as clone queues, record what to preempt and send kill event when max-wait reaches. 2) Candidates-selection policy, includes calculate ideal allocation and select preemption candidates - Original calculate ideal allocation and select preemption candidates goes to two classes 1) FifoPreemptableAmountCalculator is for ideal allocation calculation 2) FifoCandidatesSelectionPolicy is for how to select containers - CandidatesSelectionPolicy and calculator needs to read some fields from PCPP, so I add an interface for them to use, which is implemented by PCPP: CapacitySchedulerPreemptionContext - Moved all configurations keys from PCPP to CapacitySchedulerConfiguration, so admin can set configurations in either yarn-site.xml or capacity-scheduler.xml. (Ideally should be set in capacity-scheduler.xml, however, existing user sets configs in yarn-site.xml. Since CapacitySchedulerConfiguration reads yarn-site.xml as well, it is backward compatible change.) Thanks, > Refactor existing Preemption Policy of CS for easier adding new approach to > select preemption candidates > > > Key: YARN-4822 > URL: https://issues.apache.org/jira/browse/YARN-4822 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4822.1.patch, YARN-4822.2.patch, YARN-4822.3.patch, > YARN-4822.4.patch > > > Currently, ProportionalCapacityPreemptionPolicy has hard coded logic to > select candidates to be preempted (based on FIFO order of > applications/containers). It's not a simple to add new candidate-selection > logics, such as preemption for large container, intra-queeu fairness/policy, > etc. > In this JIRA, I propose to do following changes: > 1) Cleanup code bases, consolidate current logic into 3 stages: > - Compute ideal sharing of queues > - Select to-be-preempt candidates > - Send preemption/kill events to scheduler > 2) Add a new interface: {{PreemptionCandidatesSelectionPolicy}} for above > "select to-be-preempt candidates" part. Move existing how to select > candidates logics to {{FifoPreemptionCandidatesSelectionPolicy}}. > 3) Allow multiple PreemptionCandidatesSelectionPolicies work together in a > chain. Preceding PreemptionCandidatesSelectionPolicy has higher priority to > select candidates, and later PreemptionCandidatesSelectionPolicy can make > decisions according to already selected candidates and pre-computed queue > ideal shares of resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4285) Display resource usage as percentage of queue and cluster in the RM UI
[ https://issues.apache.org/jira/browse/YARN-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209449#comment-15209449 ] Wangda Tan commented on YARN-4285: -- [~jianhe], bq. However, the queue's used resource in the UI does include reserved resource too. IIUC, it should queue's used resource should include reserved resources [~vvasudev], bq. it makes sense to remove reserved resources from the used resources, Actually I think we should include reserved resources by used resources, unless we can show them together on UI. See my [#2 comment|https://issues.apache.org/jira/browse/YARN-4678?focusedCommentId=15209365&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209365]. bq. ...but do we know why we counted reserved resources as part of used resources in the first place? The reason is, if we create a reserved container under a queue, we need to make sure it doesn't go beyond queue's max capacity. In another word, if resource is reserved by someone, nobody else can use that part of resources. >From YARN's perspective, a queue has 99G allocated (not reserved) + 1G >reserved is as same as 1G allocated + 99G reserved. To be more transparent to >users and avoid answer questions like: "why my total allocated resource is >always less than total resources, used resource should be allocated + reserved. > Display resource usage as percentage of queue and cluster in the RM UI > -- > > Key: YARN-4285 > URL: https://issues.apache.org/jira/browse/YARN-4285 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: YARN-4285.001.patch, YARN-4285.002.patch, > YARN-4285.003.patch, YARN-4285.004.patch > > > Currently, we display the memory and vcores allocated to an app in the RM UI. > It would be useful to display the resources consumed as a %of the queue and > the cluster to identify apps that are using a lot of resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4859) [Bug] Unable to submit a job to a reservation when using FairScheduler
Subru Krishnan created YARN-4859: Summary: [Bug] Unable to submit a job to a reservation when using FairScheduler Key: YARN-4859 URL: https://issues.apache.org/jira/browse/YARN-4859 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Subru Krishnan Assignee: Arun Suresh Jobs submitted to a reservation get stuck at scheduled stage when using FairScheduler. I came across this when working on YARN-4827 (documentation for configuring ReservationSystem for FairScheduler) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4751) In 2.7, Labeled queue usage not shown properly in capacity scheduler UI
[ https://issues.apache.org/jira/browse/YARN-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209373#comment-15209373 ] Wangda Tan commented on YARN-4751: -- Hi [~eepayne], [~sunilg]. Quickly read discussions and looked at patch. Several questions / comments: 1) The ultimate solution seems to be YARN-3362. Have you evaluated how hard to back port it? 2) If you don't want to backport YARN-3362. IIUC, the computation of total-used-capacity-considers-all-labels seems wrong: In your patch it is Σ(queue.label.used_capacity), actually it should be Σ(queue.label.used_resource) / Σ(root.label.total_resource) Thoughts? > In 2.7, Labeled queue usage not shown properly in capacity scheduler UI > --- > > Key: YARN-4751 > URL: https://issues.apache.org/jira/browse/YARN-4751 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.3 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: 2.7 CS UI No BarGraph.jpg, > YARH-4752-branch-2.7.001.patch, YARH-4752-branch-2.7.002.patch > > > In 2.6 and 2.7, the capacity scheduler UI does not have the queue graphs > separated by partition. When applications are running on a labeled queue, no > color is shown in the bar graph, and several of the "Used" metrics are zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4678) Cluster used capacity is > 100 when container reserved
[ https://issues.apache.org/jira/browse/YARN-4678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209365#comment-15209365 ] Wangda Tan commented on YARN-4678: -- Actually I think we may need to consider this task to be 3 separate tasks: 1) Understand why reserved resource + allocated resource could excess queue's max capacity, maybe we can add a test to make sure it won't happen 2) If we simply deduct reserved resources from used and show on the UI, user could find cluster utilization is < 100 in most of the time, and it gonna be hard to explain the reason of why it cannot reach 100%. The ideal solution is that we can show reserved and allocated resources on the same bar with different color. 3) Record reserved resources in ResourceUsage and QueueCapacities separately. Thoughts? > Cluster used capacity is > 100 when container reserved > --- > > Key: YARN-4678 > URL: https://issues.apache.org/jira/browse/YARN-4678 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Sunil G > Attachments: 0001-YARN-4678.patch, 0002-YARN-4678.patch, > 0003-YARN-4678.patch > > > *Scenario:* > * Start cluster with Three NM's each having 8GB (cluster memory:24GB). > * Configure queues with elasticity and userlimitfactor=10. > * disable pre-emption. > * run two job with different priority in different queue at the same time > ** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=LOW > -Dmapreduce.job.queuename=QueueA -Dmapreduce.map.memory.mb=4096 > -Dyarn.app.mapreduce.am.resource.mb=1536 > -Dmapreduce.job.reduce.slowstart.completedmaps=1.0 10 1 > ** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=HIGH > -Dmapreduce.job.queuename=QueueB -Dmapreduce.map.memory.mb=4096 > -Dyarn.app.mapreduce.am.resource.mb=1536 3 1 > * observe the cluster capacity which was used in RM web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
[ https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt LaMantia updated YARN-4436: Attachment: YARN-4436.002.patch > DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled > - > > Key: YARN-4436 > URL: https://issues.apache.org/jira/browse/YARN-4436 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications/distributed-shell >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Matt LaMantia >Priority: Trivial > Attachments: YARN-4436.001.patch, YARN-4436.002.patch > > > It should be ExecBatScriptStringPath. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4678) Cluster used capacity is > 100 when container reserved
[ https://issues.apache.org/jira/browse/YARN-4678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209298#comment-15209298 ] Wangda Tan commented on YARN-4678: -- Hi [~sunilg], Thanks for working on this JIRA, it is useful to record reserved resources separately. However, I'm thinking how this could happen: ParentQueue's capacity will be checked when we reserve container, we should make sure that allocation of reserved container shouldn't violate parent queue's max capacity. > Cluster used capacity is > 100 when container reserved > --- > > Key: YARN-4678 > URL: https://issues.apache.org/jira/browse/YARN-4678 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Sunil G > Attachments: 0001-YARN-4678.patch, 0002-YARN-4678.patch, > 0003-YARN-4678.patch > > > *Scenario:* > * Start cluster with Three NM's each having 8GB (cluster memory:24GB). > * Configure queues with elasticity and userlimitfactor=10. > * disable pre-emption. > * run two job with different priority in different queue at the same time > ** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=LOW > -Dmapreduce.job.queuename=QueueA -Dmapreduce.map.memory.mb=4096 > -Dyarn.app.mapreduce.am.resource.mb=1536 > -Dmapreduce.job.reduce.slowstart.completedmaps=1.0 10 1 > ** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=HIGH > -Dmapreduce.job.queuename=QueueB -Dmapreduce.map.memory.mb=4096 > -Dyarn.app.mapreduce.am.resource.mb=1536 3 1 > * observe the cluster capacity which was used in RM web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4822) Refactor existing Preemption Policy of CS for easier adding new approach to select preemption candidates
[ https://issues.apache.org/jira/browse/YARN-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4822: - Attachment: YARN-4822.4.patch Attached ver.4 patch, fixed unit test failures, javac warnings. > Refactor existing Preemption Policy of CS for easier adding new approach to > select preemption candidates > > > Key: YARN-4822 > URL: https://issues.apache.org/jira/browse/YARN-4822 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4822.1.patch, YARN-4822.2.patch, YARN-4822.3.patch, > YARN-4822.4.patch > > > Currently, ProportionalCapacityPreemptionPolicy has hard coded logic to > select candidates to be preempted (based on FIFO order of > applications/containers). It's not a simple to add new candidate-selection > logics, such as preemption for large container, intra-queeu fairness/policy, > etc. > In this JIRA, I propose to do following changes: > 1) Cleanup code bases, consolidate current logic into 3 stages: > - Compute ideal sharing of queues > - Select to-be-preempt candidates > - Send preemption/kill events to scheduler > 2) Add a new interface: {{PreemptionCandidatesSelectionPolicy}} for above > "select to-be-preempt candidates" part. Move existing how to select > candidates logics to {{FifoPreemptionCandidatesSelectionPolicy}}. > 3) Allow multiple PreemptionCandidatesSelectionPolicies work together in a > chain. Preceding PreemptionCandidatesSelectionPolicy has higher priority to > select candidates, and later PreemptionCandidatesSelectionPolicy can make > decisions according to already selected candidates and pre-computed queue > ideal shares of resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4826) Document configuration of ReservationSystem for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-4826: - Attachment: YARN-4826.v1.patch > Document configuration of ReservationSystem for CapacityScheduler > - > > Key: YARN-4826 > URL: https://issues.apache.org/jira/browse/YARN-4826 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Subru Krishnan >Assignee: Subru Krishnan >Priority: Minor > Attachments: YARN-4826.v1.patch > > > This JIRA tracks the effort to add documentation on how to configure > ReservationSystem for CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209288#comment-15209288 ] Jonathan Maron commented on YARN-4757: -- {quote} Are there situations where you would just return the IP Address of the node the container is running on? {quote} One situration I can readily think of is YARN linux containers etc that are not assigned an IP. The appropriate way to manage those should be considered (I can add to Open Issues on next revision) {quote} Does that mean that we will return records for any service API no matter how the IP Addresses are assigned, or there is no way for the IP Address to not be available? {quote} Application records are generally associated with the AM and the host on which it resided (at least that's true of the Slider use cases that are the only ones currently making use of the YARN registry and service records). So most to the CNAME/TXT records mapping to an API will leverage that host IP. {quote} How is authentication with zookeeper handled? Is it always SASL+kerberos? {quote} Probably best to just point you to this writeup: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/registry/registry-security.html {quote} Would we be exposing SRV records for both of these combinations? If so how would they be named? {quote} Yes. The current design calls for the creation and registration of SRV records that have both the application name as well as the API names {quote} I am not an expert on DNS so if I say something silly after you stop laughing please let me know {quote} I have been working with dnsjava and BIND trying to learn the internals for the last few months, so I'm by no means an expert. And I'm not going to laugh - if anything I'm going to thank you profusely for the help! {quote} What about limits on the number of IP addresses that can be returned for a given name. I could not find anything specific but I have to assume that in practice most systems don't support a huge number of these, and large clusters on YARN can easily launch hundreds or even thousands of containers for a given service. {quote} I'd have to look into the relevant RFC's and other literature to see if there is a length limit. Generally documentation point to the host name RFC (1123?). I think limits on length of name would also be dictated by other software products (DBs etc). So we'd have to consider any "shortening" that may be required. You can have multiple addresses mapped to a single name, e.g. {code} HW10386:hadoop jmaron$ nslookup www.google.com Server: 192.168.1.1 Address:192.168.1.1#53 Non-authoritative answer: Name: www.google.com Address: 63.117.14.150 Name: www.google.com Address: 63.117.14.151 Name: www.google.com Address: 63.117.14.155 Name: www.google.com Address: 63.117.14.154 Name: www.google.com Address: 63.117.14.153 Name: www.google.com Address: 63.117.14.148 Name: www.google.com Address: 63.117.14.149 Name: www.google.com Address: 63.117.14.152 {code} So, some of the naming conventions (e.g. component name) may point to multiple container IPs. Addressing that through component name uniqueness (there is a slider JIRA for that) may be one possibility. {quote} In addition to Allen's concerns the document does not seem to address/call out my initial concerns about requiring mutual authentication, or handling of port availability in scheduling. {quote} I'm going to need a little more help in understanding these concerns. The approach we provide is targeted at supporting standard DNS clients, and DNS does not provide for mutual authentication - the concept of restricting who can query the DNS for records is considered outside the scope of the protocol. As for port availability - currently the DNS implementation is targeted at relaying the port assignments as designated by YARN scheduler, rather than actively participating in the scheduling itself. So I assume I'm misunderstanding > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of serviceÂ-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice,
[jira] [Commented] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
[ https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209209#comment-15209209 ] Hadoop QA commented on YARN-4436: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 13s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 12s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 11s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 12s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: patch generated 1 new + 50 unchanged - 1 fixed = 51 total (was 51) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 10s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 12s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 16s {color} | {color:green} hadoop-yarn-applications-distributedshell in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 29s {color} | {color:green} hadoop-yarn-applications-distributedshell in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 27m 42s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:fbe3e86 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12795048/YARN-4436.001.patch | | JIRA Issue | YARN-4436 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 04b6529d8b2a
[jira] [Updated] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking
[ https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Zhi updated YARN-4676: - Attachment: YARN-4676.008.patch rebased to latest trunk code, merged and resolved conflict with the recently-added DECOMMISSIONING node resource update logic. > Automatic and Asynchronous Decommissioning Nodes Status Tracking > > > Key: YARN-4676 > URL: https://issues.apache.org/jira/browse/YARN-4676 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Zhi >Assignee: Daniel Zhi > Labels: features > Attachments: GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, > YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, > YARN-4676.008.patch > > > DecommissioningNodeWatcher inside ResourceTrackingService tracks > DECOMMISSIONING nodes status automatically and asynchronously after > client/admin made the graceful decommission request. It tracks > DECOMMISSIONING nodes status to decide when, after all running containers on > the node have completed, will be transitioned into DECOMMISSIONED state. > NodesListManager detect and handle include and exclude list changes to kick > out decommission or recommission as necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209067#comment-15209067 ] Robert Joseph Evans commented on YARN-4757: --- I also did a quick pass through the document and I wanted to clarify a few things. So in some places in the document, like with names that map to containers and names that map to components it says something like "If Available" indicating that if an IP address is not assigned to the individual container no mapping will be made. Am I interpreting that correctly? Are there situations where you would just return the IP Address of the node the container is running on? Am I just mistaken in my interpretation and there are different situations where we could launch a container that would have no IP address available. However for the per application records there is no such conditional. Does that mean that we will return records for any service API no matter how the IP Addresses are assigned, or there is no way for the IP Address to not be available? Also I am not super familiar with the slider registry so perhaps you could clarify a few things there too. How is authentication with zookeeper handled? Is it always SASL+kerberos? Just because the doc mentions that the RM has to set up the base user directory with permissions. Would then any secure slider app that wants to use the registry be required to ship a keytab with their application? Also I am not super familiar with the existing registry API, from the example in the doc it shows a few different types of services that an Application Master can register. Both Host/Port and URI. Would we be exposing SRV records for both of these combinations? If so how would they be named? I am also curious about limits to various DNS fields both in the protocol and in practice with common implementations. I am not an expert on DNS so if I say something silly after you stop laughing please let me know. The document talks a lot about doing character remapping and having to have unique application names, but it does not talk about limits to the lengths of those names (I have seen some DNS servers don't support more then 254 character names). What about limits on the number of IP addresses that can be returned for a given name. I could not find anything specific but I have to assume that in practice most systems don't support a huge number of these, and large clusters on YARN can easily launch hundreds or even thousands of containers for a given service. In addition to Allen's concerns the document does not seem to address/call out my initial concerns about requiring mutual authentication, or handling of port availability in scheduling. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of serviceÂ-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endÂpoints of a service is not easy to implement > using the present registryÂ-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-Âknown DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
[ https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209022#comment-15209022 ] Daniel Templeton commented on YARN-4436: LGTM. +1 (non-binding). [~rkanter], wanna do the honors after Jenkins reports back? > DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled > - > > Key: YARN-4436 > URL: https://issues.apache.org/jira/browse/YARN-4436 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications/distributed-shell >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Matt LaMantia >Priority: Trivial > Attachments: YARN-4436.001.patch > > > It should be ExecBatScriptStringPath. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
[ https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt LaMantia updated YARN-4436: Attachment: YARN-4436.001.patch > DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled > - > > Key: YARN-4436 > URL: https://issues.apache.org/jira/browse/YARN-4436 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications/distributed-shell >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Matt LaMantia >Priority: Trivial > Attachments: YARN-4436.001.patch > > > It should be ExecBatScriptStringPath. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209001#comment-15209001 ] Varun Saxena commented on YARN-3863: [~sjlee0], kindly review. I had replaced the patch after rebasing it. > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-YARN-2928.v2.03.patch, > YARN-3863-YARN-2928.v2.04.patch, YARN-3863-YARN-2928.v2.05.patch, > YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208950#comment-15208950 ] Jonathan Maron commented on YARN-4757: -- a) A records are more usable from an existing client interaction perspective. For example, you can use a tool such as nslookup to map from a known name to its IP. You could potentially leverage an SRV record in that instance, but you'd have to go into the interactive mode of nslookup, set the type, and then perform the query - a less intuitive and well known approach. b) It's not a matter of managing a named.conf file as much as setting up bind to support the dynamic update protocol (YARN containers will come up and go down and those record updates may be relatively frequent). In addition, the stateful complaint has more to do with the need to synch state in multiple processes rather than rely on one source of truth. Finally, the security needs for an internal zone server are finite enough that, if security was the primary driver, would make the BIND selection overkill. c) Not familiar with manta (even initial web searches didn't seem to bring anything up)? If there is an open source, available solution I'd be more happy to evaluate its potential use. d) I'm not sure the problem is necessarily solved. DNS is well understood, obviously. But the use case here - mirroring the details of an existing ZK-based registry or, more accurately, the state of the YARN cluster - present some requirements that perhaps can be best addressed by a tailored solution. Given the availability of APIs such as dnsjava etc. the approach is not necessarily daunting from a development perspective. As such, testing can be performed to address security and performance concerns, though I'm not naive - I understand some issues will not manifest till actual deployment. > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of serviceÂ-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endÂpoints of a service is not easy to implement > using the present registryÂ-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-Âknown DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208925#comment-15208925 ] Allen Wittenauer commented on YARN-4757: I did a quick pass, so I'll need to read more in-depth, but I have some concerns: a) I'm still not sure what value registering A records are here when you can point a SRV record in the fake DNS zone to an existing host in an existing zone using the existing DNS services. This eliminates a ton of corner cases (split zones, NAT, multi-nics, etc) that will need to be covered when registering As. b) The BIND cons are very... odd: * I'm not particularly sure what you find complex about BIND? Most named.conf's aren't complex and rarely change after initial install in my experience. Managing the zone files isn't particularly hard and lots of tools exist in this space for large scale deployments. * You're effectively trading multiple instances of BIND for multiple instances of ZK. * I don't understand the 'stateful' complaint given that, again, you're trading state of BIND for the state stored in ZK. * Better security requirements sounds like a good thing to me... c) Where are the comparisons with other open source DNS solutions? Doesn't Manta already have something exactly like this already? d) The NIH DNS server solution: * "No operational dependencies on elements external to the Hadoop cluster"... Nothing says "thrown over the fence" like "no operational dependencies" when stated by a developer. * it's unknown how well it's going to perform at scale. * no idea how secure it's actually going to be--spoofing, MITM, etc. * admins have zero experience with it vs. pre-existing solutions so will be a knowledge gap. (Never mind the "the software doesn't exist yet so how can someone have experience with it?" problem...) * increases the source footprint for what is effectively a solved problem > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of serviceÂ-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endÂpoints of a service is not easy to implement > using the present registryÂ-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-Âknown DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208906#comment-15208906 ] Bikas Saha commented on YARN-1040: -- It would be great if existing apps can use the changes in YARN-1040 to be able to run more than a single process (sequentially or concurrently). If we use YARN-1040 to build the primitives here then those primitives could be used for the broader work designed for services (which seems to be indicated in the design doc). Without YARN-1040, existing java based apps cannot use features like increasing container memory because the JVM has to be restarted before it can grow to a larger size. I can see the argument of asking users to use new APIs for new features but requiring existing apps to change their AM/RM implementations (that have been stabilized with much effort) just to be able to launch multiple processes does not seem empathetic. Separately from this, I have not been actively involved in the project for a while. Hence my understanding of the scope and semantic changes proposed in it may be stale and I may be inaccurate in thinking that these are fundamental enough to be done in a special jira for that purpose for a wider discussion. You guys can make a call on that. > De-link container life cycle from an Allocation > --- > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > Attachments: YARN-1040-rough-design.pdf > > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2883) Queuing of container requests in the NM
[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208893#comment-15208893 ] Konstantinos Karanasos commented on YARN-2883: -- Thanks for the feedback, [~chris.douglas] and [~kasha]! I am in the process of addressing Chris' comments -- will upload a new patch soon. Regarding Karthik's comments: bq. Any reason we use a map instead of a queue to store the queued containers? I am using a Map only to track the allocated containers; for the queued containers, I am using a queue, as you suggest. bq. I like that QueuingContainerManagerImpl extends ContainerManagerImpl - while we harden the queuing side of things, it will help keep the code clean. In the longer run, we might want to default to Queuing implementation and play with the queue length, but we can cross that bridge when we get there. Agreed, that was exactly our intention too. bq. IIUC, the intent is to use queueing for all opportunistic containers. The ContainerManagerImpl implementation seems to depend on whether queuing is enabled - wouldn't that affect all containers and not just opportunistic containers? In most cases (including distributed scheduling and resource over-commitment), queues will indeed only be used for opportunistic containers. However, as long as queuing is enabled, guaranteed containers might need to be queued momentarily until the opportunistic containers that block their execution get killed. That's the reason you see guaranteed containers going through the same code-path too. But again, this will not break any semantics of the guaranteed containers. bq. The patch has the author's name left against a TODO. Also, we don't want to leave orphaned TODOs - let us go ahead and file a JIRA True, I will make sure I remove any TODOs and author names. bq. The ResourceUtilization changes are not strictly related to this patch, do they? This is correct. I put them in this JIRA because they are just a couple of methods. Do you think I should create a separate JIRA for this? bq. TestQueuingContainerMgr: We typically don't wrap imports at 80 chars. Yep, will fix that. > Queuing of container requests in the NM > --- > > Key: YARN-2883 > URL: https://issues.apache.org/jira/browse/YARN-2883 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: YARN-2883-trunk.004.patch, > YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, > YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch > > > We propose to add a queue in each NM, where queueable container requests can > be held. > Based on the available resources in the node and the containers in the queue, > the NM will decide when to allow the execution of a queued container. > In order to ensure the instantaneous start of a guaranteed-start container, > the NM may decide to pre-empt/kill running queueable containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms
[ https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Maron updated YARN-4757: - Attachment: YARN-4757- Simplified discovery of services via DNS mechanisms.pdf I’ve posted a document providing greater detail concerning this effort. It is intended as a description of the background, a proposed architectural approach, implementation details, and some open issues. I've already had some initial reviews that were of great help in both describing existing points and identifying additional ones. /cc [~vvasudev], [~vinodkv], [~sidharta-s], [~ste...@apache.org], [~elserj] > [Umbrella] Simplified discovery of services via DNS mechanisms > -- > > Key: YARN-4757 > URL: https://issues.apache.org/jira/browse/YARN-4757 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Jonathan Maron > Attachments: YARN-4757- Simplified discovery of services via DNS > mechanisms.pdf > > > [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track > all related efforts.] > In addition to completing the present story of serviceÂ-registry (YARN-913), > we also need to simplify the access to the registry entries. The existing > read mechanisms of the YARN Service Registry are currently limited to a > registry specific (java) API and a REST interface. In practice, this makes it > very difficult for wiring up existing clients and services. For e.g, dynamic > configuration of dependent endÂpoints of a service is not easy to implement > using the present registryÂ-read mechanisms, *without* code-changes to > existing services. > A good solution to this is to expose the registry information through a more > generic and widely used discovery mechanism: DNS. Service Discovery via DNS > uses the well-Âknown DNS interfaces to browse the network for services. > YARN-913 in fact talked about such a DNS based mechanism but left it as a > future task. (Task) Having the registry information exposed via DNS > simplifies the life of services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
[ https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated YARN-4436: --- Assignee: Matt LaMantia (was: Devon Michaels) > DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled > - > > Key: YARN-4436 > URL: https://issues.apache.org/jira/browse/YARN-4436 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications/distributed-shell >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Matt LaMantia >Priority: Trivial > > It should be ExecBatScriptStringPath. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4813) TestRMWebServicesDelegationTokenAuthentication.testDoAs fails intermittently
[ https://issues.apache.org/jira/browse/YARN-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208734#comment-15208734 ] Daniel Templeton commented on YARN-4813: Nope. Putting this one on the back burner for now. > TestRMWebServicesDelegationTokenAuthentication.testDoAs fails intermittently > > > Key: YARN-4813 > URL: https://issues.apache.org/jira/browse/YARN-4813 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.0 >Reporter: Daniel Templeton > > {noformat} > --- > T E S T S > --- > Running > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication > Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 11.627 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication > testDoAs[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication) > Time elapsed: 0.208 sec <<< ERROR! > java.io.IOException: Server returned HTTP response code: 403 for URL: > http://localhost:8088/ws/v1/cluster/delegation-token > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication$3.call(TestRMWebServicesDelegationTokenAuthentication.java:407) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication$3.call(TestRMWebServicesDelegationTokenAuthentication.java:398) > at > org.apache.hadoop.security.authentication.KerberosTestUtils$1.run(KerberosTestUtils.java:120) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.authentication.KerberosTestUtils.doAs(KerberosTestUtils.java:117) > at > org.apache.hadoop.security.authentication.KerberosTestUtils.doAsClient(KerberosTestUtils.java:133) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication.getDelegationToken(TestRMWebServicesDelegationTokenAuthentication.java:398) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication.testDoAs(TestRMWebServicesDelegationTokenAuthentication.java:357) > Results : > Tests in error: > > TestRMWebServicesDelegationTokenAuthentication.testDoAs:357->getDelegationToken:398 > » IO > Tests run: 8, Failures: 0, Errors: 1, Skipped: 0 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2883) Queuing of container requests in the NM
[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208731#comment-15208731 ] Karthik Kambatla commented on YARN-2883: Just skimmed through the patch. Will take a more thorough look once these and Chris' comments are addressed: # Any reason we use a map instead of a queue to store the queued containers? # I like that QueuingContainerManagerImpl extends ContainerManagerImpl - while we harden the queuing side of things, it will help keep the code clean. In the longer run, we might want to default to Queuing implementation and play with the queue length, but we can cross that bridge when we get there. # IIUC, the intent is to use queueing for all opportunistic containers. The ContainerManagerImpl implementation seems to depend on whether queuing is enabled - wouldn't that affect all containers and not just opportunistic containers? # The patch has the author's name left against a TODO. Also, we don't want to leave orphaned TODOs - let us go ahead and file a JIRA # The ResourceUtilization changes are not strictly related to this patch, do they? # If ContainerExecutionEvent is only used by the Queuing implementation, should the class name reflect that? # TestQueuingContainerMgr: We typically don't wrap imports at 80 chars. > Queuing of container requests in the NM > --- > > Key: YARN-2883 > URL: https://issues.apache.org/jira/browse/YARN-2883 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: YARN-2883-trunk.004.patch, > YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, > YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch > > > We propose to add a queue in each NM, where queueable container requests can > be held. > Based on the available resources in the node and the containers in the queue, > the NM will decide when to allow the execution of a queued container. > In order to ensure the instantaneous start of a guaranteed-start container, > the NM may decide to pre-empt/kill running queueable containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4660) o.a.h.yarn.event.TestAsyncDispatcher.testDispatcherOnCloseIfQueueEmpty() swallows YarnExceptions
[ https://issues.apache.org/jira/browse/YARN-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton resolved YARN-4660. Resolution: Invalid > o.a.h.yarn.event.TestAsyncDispatcher.testDispatcherOnCloseIfQueueEmpty() > swallows YarnExceptions > > > Key: YARN-4660 > URL: https://issues.apache.org/jira/browse/YARN-4660 > Project: Hadoop YARN > Issue Type: Improvement > Components: test >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Minor > > Either we expect the exception, or we don't. Quietly swallowing it is the > wrong thing to do in any case. Introduced in YARN-3878. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4820) ResourceManager web redirects in HA mode drops query parameters
[ https://issues.apache.org/jira/browse/YARN-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208674#comment-15208674 ] Junping Du commented on YARN-4820: -- +1. Will commit it shortly if no further comments. > ResourceManager web redirects in HA mode drops query parameters > --- > > Key: YARN-4820 > URL: https://issues.apache.org/jira/browse/YARN-4820 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4820.001.patch, YARN-4820.002.patch, > YARN-4820.003.patch > > > The RMWebAppFilter redirects http requests from the standby to the active. > However it drops all the query parameters when it does the redirect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208653#comment-15208653 ] Eric Payne commented on YARN-4686: -- {quote} Hi Eric Badger and Eric Payne, TestMRJobs#testJobWithChangePriority is failing after this issue. Would you fix the test failure? I've filed MAPREDUCE-6658 for fixing the failure. {quote} Thanks, [~ajisakaa] for reporting this. [~ebadger] is looking into this. > MiniYARNCluster.start() returns before cluster is completely started > > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith Sharma K S >Assignee: Eric Badger > Fix For: 2.7.3 > > Attachments: MAPREDUCE-6507.001.patch, > YARN-4686-branch-2.7.006.patch, YARN-4686.001.patch, YARN-4686.002.patch, > YARN-4686.003.patch, YARN-4686.004.patch, YARN-4686.005.patch, > YARN-4686.006.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4814) ATS 1.5 timelineclient impl call flush after every event write
[ https://issues.apache.org/jira/browse/YARN-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208649#comment-15208649 ] Hudson commented on YARN-4814: -- FAILURE: Integrated in Hadoop-trunk-Commit #9489 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9489/]) YARN-4814. ATS 1.5 timelineclient impl call flush after every event (junping_du: rev af1d125f9ce35ec69a610674a1c5c60cc17141a7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/FileSystemTimelineWriter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java > ATS 1.5 timelineclient impl call flush after every event write > -- > > Key: YARN-4814 > URL: https://issues.apache.org/jira/browse/YARN-4814 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-4814.1.patch, YARN-4814.2.patch > > > ATS 1.5 timelineclient impl call flush after every event write. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4183) Enabling generic application history forces every job to get a timeline service delegation token
[ https://issues.apache.org/jira/browse/YARN-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208618#comment-15208618 ] Naganarasimha G R commented on YARN-4183: - Hi [~sjlee0] & [~jeagles], Shall we conclude on this ? or we may miss this eventually .. > Enabling generic application history forces every job to get a timeline > service delegation token > > > Key: YARN-4183 > URL: https://issues.apache.org/jira/browse/YARN-4183 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Naganarasimha G R > Attachments: YARN-4183.1.patch, YARN-4183.v1.001.patch, > YARN-4183.v1.002.patch > > > When enabling just the Generic History Server and not the timeline server, > the system metrics publisher will not publish the events to the timeline > store as it checks if the timeline server and system metrics publisher are > enabled before creating a timeline client. > To make it work, if the timeline service flag is turned on, it will force > every yarn application to get a delegation token. > Instead of checking if timeline service is enabled, we should be checking if > application history server is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4759) Revisit signalContainer() for docker containers
[ https://issues.apache.org/jira/browse/YARN-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208622#comment-15208622 ] Shane Kumpf commented on YARN-4759: --- Also of note, we should propagate the return code that killed the container up to the end user to allow them to ensure that an exotic signal handling worked appropriately. This can be achieved through getting the return code from the container and subtracting 128 to get the actual signal sent. {code} docker inspect -f '{{.State.ExitCode}}' {code} > Revisit signalContainer() for docker containers > --- > > Key: YARN-4759 > URL: https://issues.apache.org/jira/browse/YARN-4759 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Sidharta Seethana >Assignee: Shane Kumpf > > The current signal handling (in the DockerContainerRuntime) needs to be > revisited for docker containers. For example, container reacquisition on NM > restart might not work, depending on which user the process in the container > runs as. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4759) Revisit signalContainer() for docker containers
[ https://issues.apache.org/jira/browse/YARN-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208603#comment-15208603 ] Shane Kumpf commented on YARN-4759: --- We need to use docker client commands to signal to processes in containers versus the OS kill command. docker stop sends a SIGTERM to PID 1 and waits 10 seconds for the process to stop (by default, configurable), if the container hasn't stopped at the end of the timeout, SIGKILL is sent. docker kill, OTOH, has no delay and simply sends SIGKILL to PID 1 of the container (by default, signal configurable). Signals that invoke graceful shutdown vary between processes. For instance to gracefully shutdown nginx (allowing outstanding requests to finish) SIGQUIT should be sent. For Apache HTTPD, SIGWINCH is used for graceful shutdown. To complicate matters, the docker client sends signals PID 1 in the container, so depending on if exec form is used for CMD in the Dockerfile, the process we want to send the signal to may be a subprocess of the shell running as PID 1. User's that require specific signals will need to properly understand this limitation. We should allow for user configurable signals and timeouts. There are a couple of approaches to achieve this: 1) Only use docker kill and sleep in Java code. Docker kill accepts the --signal argument, but does not support a wait timeout. The flow would be: send signal, sleep 10 seconds by default or the user supplied sleep value. 2) Use docker stop if the user has not specified a signal. Use the default of 10 seconds for the timeout or the user supplied timeout. Use docker kill if the user supplies a signal. The default behavior should be to send a SIGTERM, sleep 10 seconds, if still running, send SIGKILL. Signal and timeouts should be configurable. How the above impacts NM reacquistion is yet to be determined, but it may make sense to make this an umbrella to split out the required changes. /cc [~sidharta-s] - thoughts on the above? > Revisit signalContainer() for docker containers > --- > > Key: YARN-4759 > URL: https://issues.apache.org/jira/browse/YARN-4759 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Sidharta Seethana >Assignee: Shane Kumpf > > The current signal handling (in the DockerContainerRuntime) needs to be > revisited for docker containers. For example, container reacquisition on NM > restart might not work, depending on which user the process in the container > runs as. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208594#comment-15208594 ] Varun Vasudev commented on YARN-1040: - Thanks for putting up the proposal [~asuresh]! bq. "ContainerId" becomes "AllocationId" Is AllocationId a new class that we will introduce or a rename of the existing ContainerId class? In either case we have some issues to sort out - the first one won't be backward compatible and in the second case, will the NM generate container ids for the individual containers? bq. An AM can receive only a single allocation on a Node, The Scheduler will "bundle" all Allocations on a Node for an app into a single Large Allocation. Can you explain why we need this restriction? bq. Each Container is tagged with a "ContainerId" which is known only to the AM. Are you referring to the current ContainerId class? If yes, why is it known only to the AM? I actually agree with both Vinod and Bikas. The current approach is a little disruptive and not very useful for existing apps. I think we should separate out allocations work into their own classes on the RM and the NM with new APIs added for the RM and the NM. I don't think we can get away with modifying the existing APIs, the one exception being the allocate call, where we can add an additional flag to indicate whether an allocation or a container is desired. Internally, we can change the implementation to have the container model use allocations but I think allocations will have to have their own state machine withe slightly different semantics than containers(on both the RM and NM). > De-link container life cycle from an Allocation > --- > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > Attachments: YARN-1040-rough-design.pdf > > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Umesh Prasad updated YARN-4852: --- Description: Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down itself. Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of memory. When digging deeper, there are around 0.5 million objects of UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains around 1.7 million objects of YarnProtos$ContainerIdProto, ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of which retain around 1 GB heap. Back to Back Full GC kept on happening. GC wasn't able to recover any heap and went OOM. JVM dumped the heap before quitting. We analyzed the heap. RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins time and went OOM. There are no spike in job submissions, container numbers at the time of issue occurrence. was: Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down itself. GC related settings Settings : -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelRemarkEnabled -XX:InitialTenuringThreshold=1 -XX:+ManagementServer -XX:InitialHeapSize=611042752 -XX:MaxHeapSize=8589934592 -XX:MaxNewSize=348966912 -XX:MaxTenuringThreshold=1 -XX:OldPLABSize=16 -XX:ParallelGCThreads=4 -XX:SurvivorRatio=8 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseConcMarkSweepGC -XX:+UseParNewGC Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of memory. When digging deeper, there are around 0.5 million objects of UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains around 1.7 million objects of YarnProtos$ContainerIdProto, ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of which retain around 1 GB heap. Back to Back Full GC kept on happening. GC wasn't able to recover any heap and went OOM. JVM dumped the heap before quitting. We analyzed the heap. RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins time and went OOM. There are no spike in job submissions, container numbers at the time of issue occurrence. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
[ https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208537#comment-15208537 ] Hadoop QA commented on YARN-4858: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s {color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 6s {color} | {color:blue} Shelldocs was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 2s {color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 37s {color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 28s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 7s {color} | {color:red} The applied patch generated 2 new + 498 unchanged - 0 fixed = 500 total (was 498) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 43s {color} | {color:green} hadoop-yarn in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 3s {color} | {color:green} hadoop-yarn in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 19m 0s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:babe025 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12795001/YARN-4858-branch-2.001.patch | | JIRA Issue | YARN-4858 | | Optional Tests | asflicense mvnsite unit shellcheck shelldocs | | uname | Linux dc7d12adebc6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | branch-2 / 7a3fd1b | | shellcheck | v0.4.3 | | shellcheck | https://builds.apache.org/job/PreCommit-YARN-Build/10857/artifact/patchprocess/diff-patch-shellcheck.txt | | JDK v1.7.0_95 Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/10857/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn U: hadoop-yarn-project/hadoop-yarn | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/10857/console | | Powered by | Apache Yetus 0.2.0 http://yetus.apache.org | This message was automatically generated. > start-yarn and stop-yarn scripts to support timeline and sharedcachemanager > --- > > Key: YARN-4858 > URL: https://issues.apache.org/jira/browse/YARN-4858 > Project: Hadoop YARN > Issue Type: Improvement > Components: scripts >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: YARN-4858-001.patch, YARN-4858-branch-2.001.patch > > > The start-yarn and stop-yarn scripts don't have any (even commented out) > support for the timeline and sharedcachemanager > Proposed: > * bash and cmd start-yarn scripts have commented out start actions > * stop-yarn scripts stop the servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Umesh Prasad updated YARN-4852: --- Description: Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down itself. GC related settings Settings : -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelRemarkEnabled -XX:InitialTenuringThreshold=1 -XX:+ManagementServer -XX:InitialHeapSize=611042752 -XX:MaxHeapSize=8589934592 -XX:MaxNewSize=348966912 -XX:MaxTenuringThreshold=1 -XX:OldPLABSize=16 -XX:ParallelGCThreads=4 -XX:SurvivorRatio=8 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseConcMarkSweepGC -XX:+UseParNewGC Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of memory. When digging deeper, there are around 0.5 million objects of UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains around 1.7 million objects of YarnProtos$ContainerIdProto, ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of which retain around 1 GB heap. Back to Back Full GC kept on happening. GC wasn't able to recover any heap and went OOM. JVM dumped the heap before quitting. We analyzed the heap. RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins time and went OOM. There are no spike in job submissions, container numbers at the time of issue occurrence. was: Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down itself. Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of memory. When digged deep, there are around 0.5 million objects of UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains around 1.7 million objects of YarnProtos$ContainerIdProto, ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of which retain around 1 GB heap. Full GC was triggered multiple times when RM went OOM and only 300 MB of heap was released. So all these objects look like live objects. RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins time and went OOM. There are no spike in job submissions, container numbers at the time of issue occurrence. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > GC related settings Settings : > > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+CMSParallelRemarkEnabled > -XX:InitialTenuringThreshold=1 > -XX:+ManagementServer > -XX:InitialHeapSize=611042752 > -XX:MaxHeapSize=8589934592 > -XX:MaxNewSize=348966912 > -XX:MaxTenuringThreshold=1 > -XX:OldPLABSize=16 > -XX:ParallelGCThreads=4 > -XX:SurvivorRatio=8 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
[ https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4858: - Attachment: YARN-4858-branch-2.001.patch this patch is branch-2; trunk will need the same only reworked for the new bash scripts. Windows will remain the same > start-yarn and stop-yarn scripts to support timeline and sharedcachemanager > --- > > Key: YARN-4858 > URL: https://issues.apache.org/jira/browse/YARN-4858 > Project: Hadoop YARN > Issue Type: Improvement > Components: scripts >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: YARN-4858-001.patch, YARN-4858-branch-2.001.patch > > > The start-yarn and stop-yarn scripts don't have any (even commented out) > support for the timeline and sharedcachemanager > Proposed: > * bash and cmd start-yarn scripts have commented out start actions > * stop-yarn scripts stop the servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
[ https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4858: - Attachment: YARN-4858-001.patch Adds the extra services, ready to be uncommented by anyone who wants them. > start-yarn and stop-yarn scripts to support timeline and sharedcachemanager > --- > > Key: YARN-4858 > URL: https://issues.apache.org/jira/browse/YARN-4858 > Project: Hadoop YARN > Issue Type: Improvement > Components: scripts >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: YARN-4858-001.patch > > > The start-yarn and stop-yarn scripts don't have any (even commented out) > support for the timeline and sharedcachemanager > Proposed: > * bash and cmd start-yarn scripts have commented out start actions > * stop-yarn scripts stop the servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6
[ https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208499#comment-15208499 ] Yi Zhou commented on YARN-4847: --- I have simulated the negative case successfully. Thanks [~Naganarasimha] for your patience :) ! > Add documentation for the Node Label features supported in 2.6 > --- > > Key: YARN-4847 > URL: https://issues.apache.org/jira/browse/YARN-4847 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Affects Versions: 2.6.4 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > We constantly face issue with what are the node label supported features in > 2.6 and general commands to use it. So it would be better to have > documentation capturing what all is supported as part of 2.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
Steve Loughran created YARN-4858: Summary: start-yarn and stop-yarn scripts to support timeline and sharedcachemanager Key: YARN-4858 URL: https://issues.apache.org/jira/browse/YARN-4858 Project: Hadoop YARN Issue Type: Improvement Components: scripts Affects Versions: 2.8.0 Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor The start-yarn and stop-yarn scripts don't have any (even commented out) support for the timeline and sharedcachemanager Proposed: * bash and cmd start-yarn scripts have commented out start actions * stop-yarn scripts stop the servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics
[ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3816: - Assignee: Li Lu (was: Junping Du) > [Aggregation] App-level aggregation and accumulation for YARN system metrics > > > Key: YARN-3816 > URL: https://issues.apache.org/jira/browse/YARN-3816 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Li Lu > Labels: yarn-2928-1st-milestone > Attachments: Application Level Aggregation of Timeline Data.pdf, > YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, > YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, > YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, > YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, > YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, > YARN-3816-poc-v2.patch > > > We need application level aggregation of Timeline data: > - To present end user aggregated states for each application, include: > resource (CPU, Memory) consumption across all containers, number of > containers launched/completed/failed, etc. We need this for apps while they > are running as well as when they are done. > - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be > aggregated to show details of states in framework level. > - Other level (Flow/User/Queue) aggregation can be more efficient to be based > on Application-level aggregations rather than raw entity-level data as much > less raws need to scan (with filter out non-aggregated entities, like: > events, configurations, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics
[ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208443#comment-15208443 ] Junping Du commented on YARN-3816: -- Sorry guys. I was planning to finish it a few month ago but we had code rebase several times and my bandwidth is quite challenging recently. Assign to Li to follow up the patch work as his YARN-3817 depends on this JIRA. > [Aggregation] App-level aggregation and accumulation for YARN system metrics > > > Key: YARN-3816 > URL: https://issues.apache.org/jira/browse/YARN-3816 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du > Labels: yarn-2928-1st-milestone > Attachments: Application Level Aggregation of Timeline Data.pdf, > YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, > YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, > YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, > YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, > YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, > YARN-3816-poc-v2.patch > > > We need application level aggregation of Timeline data: > - To present end user aggregated states for each application, include: > resource (CPU, Memory) consumption across all containers, number of > containers launched/completed/failed, etc. We need this for apps while they > are running as well as when they are done. > - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be > aggregated to show details of states in framework level. > - Other level (Flow/User/Queue) aggregation can be more efficient to be based > on Application-level aggregations rather than raw entity-level data as much > less raws need to scan (with filter out non-aggregated entities, like: > events, configurations, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3959) Store application related configurations in Timeline Service v2
[ https://issues.apache.org/jira/browse/YARN-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3959: - Assignee: Varun Saxena (was: Junping Du) > Store application related configurations in Timeline Service v2 > --- > > Key: YARN-3959 > URL: https://issues.apache.org/jira/browse/YARN-3959 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > > We already have configuration field in HBase schema for application entity. > We need to make sure AM write it out when it get launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4856) RM /ws/v1/cluster/scheduler JSON format Error
[ https://issues.apache.org/jira/browse/YARN-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208434#comment-15208434 ] Daniel Templeton commented on YARN-4856: I'll look into it. > RM /ws/v1/cluster/scheduler JSON format Error > -- > > Key: YARN-4856 > URL: https://issues.apache.org/jira/browse/YARN-4856 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 > Environment: Hadoop-2.7.1 >Reporter: zhangyubiao >Assignee: Daniel Templeton > Labels: patch > > Hadoop-2.7.1 RM /ws/v1/cluster/scheduler JSON format Error > Root Queue's ChildQueue is > {"memory":3717120,"vCores":1848},"queueName":"root","schedulingPolicy":"fair","childQueues":{color:red}[{"type":"fairSchedulerLeafQueueInfo", > {color}"maxApps":400,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0}, > But Other's ChildQueue is > {"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad","schedulingPolicy":"fair","childQueues":{"type":{color:red} > ["fairSchedulerLeafQueueInfo"], > {color}"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_anti","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_formal_1","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_oozie","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0}},{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":0,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3959) Store application related configurations in Timeline Service v2
[ https://issues.apache.org/jira/browse/YARN-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208436#comment-15208436 ] Junping Du commented on YARN-3959: -- Sure. [~varun_saxena], please go ahead. > Store application related configurations in Timeline Service v2 > --- > > Key: YARN-3959 > URL: https://issues.apache.org/jira/browse/YARN-3959 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du > Labels: yarn-2928-1st-milestone > > We already have configuration field in HBase schema for application entity. > We need to make sure AM write it out when it get launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4856) RM /ws/v1/cluster/scheduler JSON format Error
[ https://issues.apache.org/jira/browse/YARN-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton reassigned YARN-4856: -- Assignee: Daniel Templeton > RM /ws/v1/cluster/scheduler JSON format Error > -- > > Key: YARN-4856 > URL: https://issues.apache.org/jira/browse/YARN-4856 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 > Environment: Hadoop-2.7.1 >Reporter: zhangyubiao >Assignee: Daniel Templeton > Labels: patch > > Hadoop-2.7.1 RM /ws/v1/cluster/scheduler JSON format Error > Root Queue's ChildQueue is > {"memory":3717120,"vCores":1848},"queueName":"root","schedulingPolicy":"fair","childQueues":{color:red}[{"type":"fairSchedulerLeafQueueInfo", > {color}"maxApps":400,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0}, > But Other's ChildQueue is > {"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad","schedulingPolicy":"fair","childQueues":{"type":{color:red} > ["fairSchedulerLeafQueueInfo"], > {color}"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_anti","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_formal_1","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_oozie","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0}},{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":0,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4857) Missing default configuration regarding preemption of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208385#comment-15208385 ] Hadoop QA commented on YARN-4857: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 50s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 8s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 17m 10s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:fbe3e86 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12794989/YARN-4857.01.patch | | JIRA Issue | YARN-4857 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit xml | | uname | Linux fef21a1116a6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / a107cee | | Default Java | 1.7.0_95 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 | | JDK v1.7.0_95 Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/10856/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn/h
[jira] [Commented] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.
[ https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208368#comment-15208368 ] Hadoop QA commented on YARN-4849: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 50s {color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 4s {color} | {color:blue} Shelldocs was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 55s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 12s {color} | {color:green} YARN-3368 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 9s {color} | {color:green} YARN-3368 passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 49s {color} | {color:green} YARN-3368 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 9m 14s {color} | {color:green} YARN-3368 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 22s {color} | {color:green} YARN-3368 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 33s {color} | {color:green} YARN-3368 passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 9m 27s {color} | {color:green} YARN-3368 passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 5m 51s {color} | {color:red} hadoop-yarn in the patch failed. {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 25s {color} | {color:red} hadoop-yarn-ui in the patch failed. {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 6m 58s {color} | {color:red} root in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 2m 26s {color} | {color:red} root in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 2m 26s {color} | {color:red} root in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 2m 37s {color} | {color:red} root in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 2m 37s {color} | {color:red} root in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 32s {color} | {color:red} root in the patch failed. {color} | | {color:red}-1{color} | {color:red} mvneclipse {color} | {color:red} 0m 39s {color} | {color:red} root in the patch failed. {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 13s {color} | {color:red} The applied patch generated 552 new + 98 unchanged - 0 fixed = 650 total (was 98) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 50 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 1s {color} | {color:red} The patch has 235 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} xml {color} | {color:red} 0m 2s {color} | {color:red} The patch has 1 ill-formed XML file(s). {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 2m 43s {color} | {color:red} root in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 3m 41s {color} | {color:red} root in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 21m 15s {color} | {color:red} root in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 21m 47s {color} | {color:red} root in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {
[jira] [Updated] (YARN-4857) Missing default configuration regarding preemption of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated YARN-4857: - Attachment: YARN-4857.01.patch > Missing default configuration regarding preemption of CapacityScheduler > --- > > Key: YARN-4857 > URL: https://issues.apache.org/jira/browse/YARN-4857 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, documentation >Reporter: Kai Sasaki >Assignee: Kai Sasaki >Priority: Minor > Labels: documentaion > Attachments: YARN-4857.01.patch > > > {{yarn.resourcemanager.monitor.*}} configurations are missing in > yarn-default.xml. Since they were documented explicitly by YARN-4492, > yarn-default.xml can be modified as same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4857) Missing default configuration regarding preemption of CapacityScheduler
Kai Sasaki created YARN-4857: Summary: Missing default configuration regarding preemption of CapacityScheduler Key: YARN-4857 URL: https://issues.apache.org/jira/browse/YARN-4857 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, documentation Reporter: Kai Sasaki Assignee: Kai Sasaki Priority: Minor {{yarn.resourcemanager.monitor.*}} configurations are missing in yarn-default.xml. Since they were documented explicitly by YARN-4492, yarn-default.xml can be modified as same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208314#comment-15208314 ] Rohith Sharma K S commented on YARN-4852: - bq. By the way what is the hearbeat interval from AM to RM in which it will acquire the CS lock. MRAppMaster heartbeat is 1sec default. And CS lock is aquired only if there are ask resource request in heartbeat. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208301#comment-15208301 ] Gokul commented on YARN-4852: - Thanks [~rohithsharma], this gives some perspective about the starvation of Scheduler Event Processor Thread. May be YARN-3487 would bring down the probability of this issue. It took more than 30 minutes for the heap to double and go OOM. So Scheduler Event Processor would have got to process at least some nodeUpdate events. But heap was on growing state continuously and never came down. That's why I am not fully convinced that YARN-3487 would solve the issue. By the way what is the hearbeat interval from AM to RM in which it will acquire the CS lock. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208283#comment-15208283 ] Rohith Sharma K S commented on YARN-4852: - I was justifying how without YARN-3487 might cause oom. There could be other reason causing for nodeUpdate queue pill up which need to be analysed. For leaving out a suspect of YARN-3487, apply the patch in the cluster. If issue occur again it is easy to focus on particular area. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208246#comment-15208246 ] Rohith Sharma K S commented on YARN-4852: - To be more clear, *Flow-1* : Each AM heart beat or application submission try to acquire CS lock. In your cluster, 93 apps running concurrently would send resource request in AM heartbeat to RM. These many AM's heartbeat are race to obtain CS lock. *Flow-2* And other hand, scheduler event process thread dispatches events one by one. So at any point of time, only one nodeUpdate event is processed.This nodeUpdate event try to acquire a CS lock which is also in race ( From your thread dump, nodeUpdate has acquired the CS lock as I mentioned previous comment). Consider worst case where always AM heart beat is getting chance to acquire CS lock, then nodeUpdate call would be delayed. As I said scheduler event processor process an event one by one, other node update events will be piled up. Note that scheduler node status event is triggered from RMNodeIMpl. Delay in scheduler event processing does not block NodeManagers heartbeat. So NodeManager keep sending node heart beat and updating the RMNodeImpl#nodeUpdateQueue. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.
[ https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208224#comment-15208224 ] Hadoop QA commented on YARN-4849: - (!) A patch to the testing environment has been detected. Re-executing against the patched versions to perform further tests. The console is at https://builds.apache.org/job/PreCommit-YARN-Build/10855/console in case of problems. > [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add > licenses. > --- > > Key: YARN-4849 > URL: https://issues.apache.org/jira/browse/YARN-4849 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4849-YARN-3368.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4856) RM /ws/v1/cluster/scheduler JSON format Error
[ https://issues.apache.org/jira/browse/YARN-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangyubiao updated YARN-4856: -- Summary: RM /ws/v1/cluster/scheduler JSON format Error (was: RM /ws/v1/cluster/scheduler JSON format err ) > RM /ws/v1/cluster/scheduler JSON format Error > -- > > Key: YARN-4856 > URL: https://issues.apache.org/jira/browse/YARN-4856 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 > Environment: Hadoop-2.7.1 >Reporter: zhangyubiao > Labels: patch > > Hadoop-2.7.1 RM /ws/v1/cluster/scheduler JSON format Error > Root Queue's ChildQueue is > {"memory":3717120,"vCores":1848},"queueName":"root","schedulingPolicy":"fair","childQueues":{color:red}[{"type":"fairSchedulerLeafQueueInfo", > {color}"maxApps":400,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0}, > But Other's ChildQueue is > {"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad","schedulingPolicy":"fair","childQueues":{"type":{color:red} > ["fairSchedulerLeafQueueInfo"], > {color}"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_anti","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_formal_1","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_oozie","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0}},{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":0,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4856) RM /ws/v1/cluster/scheduler JSON format err
zhangyubiao created YARN-4856: - Summary: RM /ws/v1/cluster/scheduler JSON format err Key: YARN-4856 URL: https://issues.apache.org/jira/browse/YARN-4856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.1 Environment: Hadoop-2.7.1 Reporter: zhangyubiao Hadoop-2.7.1 RM /ws/v1/cluster/scheduler JSON format Error Root Queue's ChildQueue is {"memory":3717120,"vCores":1848},"queueName":"root","schedulingPolicy":"fair","childQueues":{color:red}[{"type":"fairSchedulerLeafQueueInfo", {color}"maxApps":400,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0}, But Other's ChildQueue is {"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad","schedulingPolicy":"fair","childQueues":{"type":{color:red} ["fairSchedulerLeafQueueInfo"], {color}"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_anti","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_formal_1","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_oozie","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0}},{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":0,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208216#comment-15208216 ] Hadoop QA commented on YARN-3863: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 6 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 11m 23s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 19s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 42s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 18s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice: patch generated 5 new + 4 unchanged - 1 fixed = 9 total (was 5) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 45s {color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 36s {color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 28m 54s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12794930/YARN-3863-YARN-2928.v2.05.patch | | JIRA Issue | YARN-3863 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux ccbcd85a77e6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | |
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208204#comment-15208204 ] Gokul commented on YARN-4852: - Agreed 7 threads are waiting to lock CapacityScheduler.getQueueInfo. What is the impact if these many threads are waiting on this lock in application submission phase? Will it be the cause for RMNodeImpl.nodeUpdateQueue piling up? If yes then YARN-3487 will fix the issue. Else there should be some other reason - like the consumer thread of the queue(RMNodeImpl.nodeUpdateQueue) which is ResourceManager Event processor stuck at something that it is not draining the queue. Also the thread which is doing nodeUpdate(ResourceManager Event processor) is not in blocked state. It is still runnable. There are around 1200 NMs in the cluster. 93 apps were running when issue occurred. The number of containers allocated were 17803 and pending were 63422. Job submission rate was roughly 6 per minute. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208172#comment-15208172 ] Rohith Sharma K S commented on YARN-4852: - Looking at your attached threadump, I feel root cause for your issue is YARN-3487. May be you can try if it is recurring regularly. >From the thread dump, I see that there are 8 threads are waiting for CS lock out of 7 are {{CapacityScheduler.getQueueInf}} which are called from validating resource request either during application submission for AM resource request OR for AM heartbeat request. At this time, nodeUpdate is holding the CS lock. This would take few mills to process containers status if more are there. {code} at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:1190) - locked <0x0005d4cfe5c8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:951) - locked <0x0005d4cfe5c8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) {code} In larger cluster what can happen is if more ApplicationsMaster are running concurrently and application submission rate is very high, then significantly nodeUpdate will be blocked for obtaining CS lock. The reason for blocking is YARN-3487. So if more NodeManagers are there then time consumed to process each node update increase which internally pill up the container status and might be causing oom. Just for an info, How many NodeManagers are there in cluster? How many AM are running concurrently and How many tasks per job? what is the job submission rate? > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4855) Should check if node exists when replace nodelabels
Tao Jie created YARN-4855: - Summary: Should check if node exists when replace nodelabels Key: YARN-4855 URL: https://issues.apache.org/jira/browse/YARN-4855 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Tao Jie Priority: Minor Today when we add nodelabels to nodes, it would succeed even if nodes are not existing NodeManger in cluster without any message. It could be like this: When we use *yarn rmadmin -replaceLabelsOnNode "node1=label1"*, it would be denied if node does not exist. When we use *yarn rmadmin -replaceLabelsOnNode -force "node1=label1"* would add nodelabels no matter whether node exists -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208103#comment-15208103 ] Akira AJISAKA commented on YARN-4686: - Hi [~ebadger] and [~eepayne], TestMRJobs#testJobWithChangePriority is failing after this issue. Would you fix the test failure? I've filed MAPREDUCE-6658 for fixing the failure. > MiniYARNCluster.start() returns before cluster is completely started > > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith Sharma K S >Assignee: Eric Badger > Fix For: 2.7.3 > > Attachments: MAPREDUCE-6507.001.patch, > YARN-4686-branch-2.7.006.patch, YARN-4686.001.patch, YARN-4686.002.patch, > YARN-4686.003.patch, YARN-4686.004.patch, YARN-4686.005.patch, > YARN-4686.006.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6
[ https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208093#comment-15208093 ] Yi Zhou commented on YARN-4847: --- Thanks! I will double check this. > Add documentation for the Node Label features supported in 2.6 > --- > > Key: YARN-4847 > URL: https://issues.apache.org/jira/browse/YARN-4847 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Affects Versions: 2.6.4 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > We constantly face issue with what are the node label supported features in > 2.6 and general commands to use it. So it would be better to have > documentation capturing what all is supported as part of 2.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6
[ https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208092#comment-15208092 ] Yi Zhou commented on YARN-4847: --- Thanks! I will double check this. > Add documentation for the Node Label features supported in 2.6 > --- > > Key: YARN-4847 > URL: https://issues.apache.org/jira/browse/YARN-4847 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Affects Versions: 2.6.4 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > We constantly face issue with what are the node label supported features in > 2.6 and general commands to use it. So it would be better to have > documentation capturing what all is supported as part of 2.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6
[ https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208090#comment-15208090 ] Naganarasimha G R commented on YARN-4847: - I tested in 2.6.4 > Add documentation for the Node Label features supported in 2.6 > --- > > Key: YARN-4847 > URL: https://issues.apache.org/jira/browse/YARN-4847 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Affects Versions: 2.6.4 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > We constantly face issue with what are the node label supported features in > 2.6 and general commands to use it. So it would be better to have > documentation capturing what all is supported as part of 2.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6
[ https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208084#comment-15208084 ] Naganarasimha G R commented on YARN-4847: - Hi [~jameszhouyi], i tried testing with your configuration and i was able to see exception being thrown {code} 16/03/23 14:20:14 FATAL distributedshell.Client: Error running Client org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, queue=m doesn't have permission to access all labels in resource request. labelExpression of resource request=y. Queue labels=* at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:289) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) {code} > Add documentation for the Node Label features supported in 2.6 > --- > > Key: YARN-4847 > URL: https://issues.apache.org/jira/browse/YARN-4847 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Affects Versions: 2.6.4 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > We constantly face issue with what are the node label supported features in > 2.6 and general commands to use it. So it would be better to have > documentation capturing what all is supported as part of 2.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6
[ https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208021#comment-15208021 ] Yi Zhou commented on YARN-4847: --- Hi [~Naganarasimha] Thank you for your great help! OK. If it is relative to doc, i will input here. i will post my issues in mailing list.. > Add documentation for the Node Label features supported in 2.6 > --- > > Key: YARN-4847 > URL: https://issues.apache.org/jira/browse/YARN-4847 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Affects Versions: 2.6.4 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > We constantly face issue with what are the node label supported features in > 2.6 and general commands to use it. So it would be better to have > documentation capturing what all is supported as part of 2.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3959) Store application related configurations in Timeline Service v2
[ https://issues.apache.org/jira/browse/YARN-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207998#comment-15207998 ] Varun Saxena commented on YARN-3959: [~djp], I can work on this issue if you are not planning to work on this in short term, as this is marked for 1st milestone. Do let me know. > Store application related configurations in Timeline Service v2 > --- > > Key: YARN-3959 > URL: https://issues.apache.org/jira/browse/YARN-3959 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du > Labels: yarn-2928-1st-milestone > > We already have configuration field in HBase schema for application entity. > We need to make sure AM write it out when it get launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6
[ https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207994#comment-15207994 ] Naganarasimha G R commented on YARN-4847: - Let me check your issue, Also as [~wangda] and [~sunilg] were mentioning it would be better to capture usability issues in the forums rather than here. Main intention of this jira is to capture the documentation and if for it anything required we can discuss in this jira. And as part of this documentation i would be more documenting on the aspect that what is supported as part of 2.6.x Node label than what not is supported in 2.6.x. in comparison with 2.7.x or later. > Add documentation for the Node Label features supported in 2.6 > --- > > Key: YARN-4847 > URL: https://issues.apache.org/jira/browse/YARN-4847 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Affects Versions: 2.6.4 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > We constantly face issue with what are the node label supported features in > 2.6 and general commands to use it. So it would be better to have > documentation capturing what all is supported as part of 2.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics
[ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207995#comment-15207995 ] Varun Saxena commented on YARN-3816: [~sjlee0], Maybe in Thursday's meeting, we can revisit open 1st milestone JIRAs' and check if assignees have the bandwidth or not. If Junping does not have bandwidth, I can pitch in on couple of his open JIRAs' too. > [Aggregation] App-level aggregation and accumulation for YARN system metrics > > > Key: YARN-3816 > URL: https://issues.apache.org/jira/browse/YARN-3816 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du > Labels: yarn-2928-1st-milestone > Attachments: Application Level Aggregation of Timeline Data.pdf, > YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, > YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, > YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, > YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, > YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, > YARN-3816-poc-v2.patch > > > We need application level aggregation of Timeline data: > - To present end user aggregated states for each application, include: > resource (CPU, Memory) consumption across all containers, number of > containers launched/completed/failed, etc. We need this for apps while they > are running as well as when they are done. > - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be > aggregated to show details of states in framework level. > - Other level (Flow/User/Queue) aggregation can be more efficient to be based > on Application-level aggregations rather than raw entity-level data as much > less raws need to scan (with filter out non-aggregated entities, like: > events, configurations, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4820) ResourceManager web redirects in HA mode drops query parameters
[ https://issues.apache.org/jira/browse/YARN-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207993#comment-15207993 ] Varun Vasudev commented on YARN-4820: - The test failures are unrelated to the patch. > ResourceManager web redirects in HA mode drops query parameters > --- > > Key: YARN-4820 > URL: https://issues.apache.org/jira/browse/YARN-4820 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4820.001.patch, YARN-4820.002.patch, > YARN-4820.003.patch > > > The RMWebAppFilter redirects http requests from the standby to the active. > However it drops all the query parameters when it does the redirect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.
[ https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4849: - Attachment: YARN-4849-YARN-3368.1.patch > [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add > licenses. > --- > > Key: YARN-4849 > URL: https://issues.apache.org/jira/browse/YARN-4849 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4849-YARN-3368.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.
[ https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4849: - Attachment: (was: YARN-4849.1.patch) > [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add > licenses. > --- > > Key: YARN-4849 > URL: https://issues.apache.org/jira/browse/YARN-4849 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4285) Display resource usage as percentage of queue and cluster in the RM UI
[ https://issues.apache.org/jira/browse/YARN-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207991#comment-15207991 ] Varun Vasudev commented on YARN-4285: - [~jianhe] - it makes sense to remove reserved resources from the used resources, but do we know why we counted reserved resources as part of used resources in the first place? > Display resource usage as percentage of queue and cluster in the RM UI > -- > > Key: YARN-4285 > URL: https://issues.apache.org/jira/browse/YARN-4285 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: YARN-4285.001.patch, YARN-4285.002.patch, > YARN-4285.003.patch, YARN-4285.004.patch > > > Currently, we display the memory and vcores allocated to an app in the RM UI. > It would be useful to display the resources consumed as a %of the queue and > the cluster to identify apps that are using a lot of resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)