[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267224#comment-16267224 ] Hudson commented on YARN-6168: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13279 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13279/]) YARN-6168. Restarted RM may not inform AM about all existing containers. (jianhe: rev fedabcad42067ac7dd24de40fab6be2d3485a540) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/Allocation.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/DefaultAMSProcessor.java > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > Fix For: 3.1.0 > > Attachments: YARN-6168.001.patch, YARN-6168.002.patch, > YARN-6168.003.patch, YARN-6168.004.patch > > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267191#comment-16267191 ] Chandni Singh commented on YARN-6168: - Thanks [~jianhe] > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > Fix For: 3.1.0 > > Attachments: YARN-6168.001.patch, YARN-6168.002.patch, > YARN-6168.003.patch, YARN-6168.004.patch > > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267181#comment-16267181 ] Jian He commented on YARN-6168: --- I committed this into trunk. Thanks [~csingh] ! > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > Fix For: 3.1.0 > > Attachments: YARN-6168.001.patch, YARN-6168.002.patch, > YARN-6168.003.patch, YARN-6168.004.patch > > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267173#comment-16267173 ] Chandni Singh commented on YARN-6168: - Test and findbug failure are not related to the change > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > Attachments: YARN-6168.001.patch, YARN-6168.002.patch, > YARN-6168.003.patch, YARN-6168.004.patch > > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261794#comment-16261794 ] Hadoop QA commented on YARN-6168: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 40s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 9s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 45s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 6m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 7s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 55s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 43s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 13s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 48s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}135m 19s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 | | JIRA Issue | YARN-6168 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12898748/YARN-6168.004.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle cc | | uname | Linux 0f83cab7f6a5
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261678#comment-16261678 ] Hadoop QA commented on YARN-6168: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 46s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 6s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 48s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 9s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 17s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 37s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 37s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 9s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 13s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}136m 4s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 | | JIRA Issue | YARN-6168 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12898730/YARN-6168.003.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261552#comment-16261552 ] Jian He commented on YARN-6168: --- A minor optimization for pullPreviousAttemptContainers : we could add a check if the size == 0, return. could you add a little comment: for pullContainersToTransfer, // called when AM registers for pullPreviousAttemptContainers, // called when AM heartbeats if there are containers not reported in register. > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > Attachments: YARN-6168.001.patch, YARN-6168.002.patch, > YARN-6168.003.patch > > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260204#comment-16260204 ] Hadoop QA commented on YARN-6168: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 23s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 5s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 7s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in trunk has 1 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 11s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 42s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 11s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 56s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 1 new + 208 unchanged - 0 fixed = 209 total (was 208) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 35s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 38s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 42s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 14s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 54s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}134m 51s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259960#comment-16259960 ] Jian He commented on YARN-6168: --- could you also add more detailed comments in AllocateResponse#get/setContainersFromPreviousAttempts to explain the scenario the containers might not be received in the previous register call ? > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > Attachments: YARN-6168.001.patch > > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259957#comment-16259957 ] Jian He commented on YARN-6168: --- - AllocateResponsePBImpl#mergeLocalToBuilder needs some changes too ? - recoveredPreviousAttemptContainers, the type can be Container type, so that pullPreviousAttemptContainers doesn't need to transform RMContainer to container. - I think getLiveContainers and clearPreviousContainers need to be in same synchronization block. Otherwise, it is possible to lose the previous containers such as: 1. AM acquires the live containers on register 2. containers added to live container and previous containers 3. clear previous containers {code} Collection liveContainers = app.getCurrentAppAttempt().getLiveContainers(); app.getCurrentAppAttempt().resetPreviousAttemptContainers(); {code} - could you add comments in the header of testContainersFromPreviousAttemptsWithRMRestart to explain what the tests do, so that others don't need to dig into the code to understand what it does. > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > Attachments: YARN-6168.001.patch > > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259697#comment-16259697 ] Chandni Singh commented on YARN-6168: - [~billie.rinaldi] [~jianhe] Can you please review > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > Attachments: YARN-6168.001.patch > > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256357#comment-16256357 ] Hadoop QA commented on YARN-6168: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 15s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 26s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in trunk has 1 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 29s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 52s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 26s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 3s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 2 new + 209 unchanged - 0 fixed = 211 total (was 209) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 21s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 41s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 39s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 52s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 55m 35s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}144m 39s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.api.TestPBImplRecords | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16246744#comment-16246744 ] Chandni Singh commented on YARN-6168: - The default value of {{nmExpiryInterval}} is 10 minutes. That will be too long for apps to recover and also this time cannot be influenced by any app setting. So, I prefer the solution proposed by [~jianhe]. Please let me know your thoughts. > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Chandni Singh > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146075#comment-16146075 ] Jian He commented on YARN-6168: --- Probably one way would be to change the AM heartbeat to also return previous running containers, right now it is only returned in registerApplicationMaster response. We can even deprecate the old one, and only have one place (AM heartbeat response) to return the old containers > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947326#comment-15947326 ] Jason Lowe commented on YARN-6168: -- This sounds like the RM isn't waiting long enough for all the live NMs to report in before reporting the live containers to the app. Technically it would have to wait up to the full NM expiry interval before it could know for sure no more containers are going to be reported by late-heartbeating NMs, so once fix would be to hold off AM restarts of container-preserving apps after an RM restart until the NM expiry interval has passed since restart. However I don't know if apps are willing to wait that long before their AM recovers. If not then there is always going to be the possibility that not all live containers are reported when the AM restarts and registers if an NM ends jup heartbeating late. > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947305#comment-15947305 ] Billie Rinaldi commented on YARN-6168: -- Yes. In my case, the AM requested new containers immediately and got them allocated, so when it was informed later about the old containers, it just released them. > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6168) Restarted RM may not inform AM about all existing containers
[ https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946721#comment-15946721 ] Shen Yinjie commented on YARN-6168: --- the case happened few times after restartedafter more heartbeats ,am will be informed of actural live-containers. > Restarted RM may not inform AM about all existing containers > > > Key: YARN-6168 > URL: https://issues.apache.org/jira/browse/YARN-6168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi > > There appears to be a race condition when an RM is restarted. I had a > situation where the RMs and AM were down, but NMs and app containers were > still running. When I restarted the RM, the AM restarted, registered with the > RM, and received its list of existing containers before the NMs had reported > all of their containers to the RM. The AM was only told about some of the > app's existing containers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org