[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524587#comment-16524587 ] genericqa commented on YARN-8459: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 26s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 57s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 6s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 68m 37s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}122m 28s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8459 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12929311/YARN-8459.002.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux bc6b5718491b 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / bedc4fe | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21125/testReport/ | | Max. process+thread count | 950 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21125/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT
[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524476#comment-16524476 ] Weiwei Yang commented on YARN-8459: --- Hi [~leftnoteasy], [~sunilg] bq. c. Reserve on node operation finished after app_1 removed (doneApplicationAttempt). Do we have a check in commit phase to make sure a reservation only can be made for a valid app and on a valid node? Allocate may make invalid proposals as it is not holding the CS lock, but we could reject it in commit phase, can we? > Capacity Scheduler should properly handle container allocation on app/node > when app/node being removed by scheduler > --- > > Key: YARN-8459 > URL: https://issues.apache.org/jira/browse/YARN-8459 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-8459.001.patch, YARN-8459.002.patch > > > Thanks [~gopalv] for reporting this issue. > In async mode, capacity scheduler can allocate/reserve containers on node/app > when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}). > This will cause some issues, for example. > a. Container for app_1 reserved on node_x. > b. At the same time, app_1 is being removed. > c. Reserve on node operation finished after app_1 removed > ({{doneApplicationAttempt}}). > For all the future runs, the node_x is completely blocked by the invalid > reservation. It keep reporting "Trying to schedule for a finished app, please > double check" for the node_x. > We need a fix to make sure this won't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524472#comment-16524472 ] Wangda Tan commented on YARN-8459: -- Thanks [~sunilg], Addressed #1. For #2, it is required since we need to revert changes in previous commonReserve. > Capacity Scheduler should properly handle container allocation on app/node > when app/node being removed by scheduler > --- > > Key: YARN-8459 > URL: https://issues.apache.org/jira/browse/YARN-8459 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-8459.001.patch, YARN-8459.002.patch > > > Thanks [~gopalv] for reporting this issue. > In async mode, capacity scheduler can allocate/reserve containers on node/app > when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}). > This will cause some issues, for example. > a. Container for app_1 reserved on node_x. > b. At the same time, app_1 is being removed. > c. Reserve on node operation finished after app_1 removed > ({{doneApplicationAttempt}}). > For all the future runs, the node_x is completely blocked by the invalid > reservation. It keep reporting "Trying to schedule for a finished app, please > double check" for the node_x. > We need a fix to make sure this won't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524276#comment-16524276 ] Sunil Govindan commented on YARN-8459: -- Thanks [~leftnoteasy] This makes sense to me. Some comments # In FiCaSchedulerApp#accept call {{if (isStopping)}} is missing a return statement. # If node.reserveResource is failing, is it completed needed to call internalUnReserve? > Capacity Scheduler should properly handle container allocation on app/node > when app/node being removed by scheduler > --- > > Key: YARN-8459 > URL: https://issues.apache.org/jira/browse/YARN-8459 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-8459.001.patch > > > Thanks [~gopalv] for reporting this issue. > In async mode, capacity scheduler can allocate/reserve containers on node/app > when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}). > This will cause some issues, for example. > a. Container for app_1 reserved on node_x. > b. At the same time, app_1 is being removed. > c. Reserve on node operation finished after app_1 removed > ({{doneApplicationAttempt}}). > For all the future runs, the node_x is completely blocked by the invalid > reservation. It keep reporting "Trying to schedule for a finished app, please > double check" for the node_x. > We need a fix to make sure this won't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524057#comment-16524057 ] Wangda Tan commented on YARN-8459: -- [~cheersyang], According to our current locking design of CapacityScheduler: 1) Add/remove node/app requires CS lock. 2) Allocate/release container acquires app/node/queue lock only for better performance. The simplest solution is to put allocate/release container under CS lock, but it will cause performance regression. Adding a stopping flag to app/node seems like cleanest solution in my mind, please share if you have any better idea. [~sunilg], My intention is to put the setStopping under app/node lock instead of using volatile. We don't want a node is allocating container but the other thread is trying to remove the node. > Capacity Scheduler should properly handle container allocation on app/node > when app/node being removed by scheduler > --- > > Key: YARN-8459 > URL: https://issues.apache.org/jira/browse/YARN-8459 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-8459.001.patch > > > Thanks [~gopalv] for reporting this issue. > In async mode, capacity scheduler can allocate/reserve containers on node/app > when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}). > This will cause some issues, for example. > a. Container for app_1 reserved on node_x. > b. At the same time, app_1 is being removed. > c. Reserve on node operation finished after app_1 removed > ({{doneApplicationAttempt}}). > For all the future runs, the node_x is completely blocked by the invalid > reservation. It keep reporting "Trying to schedule for a finished app, please > double check" for the node_x. > We need a fix to make sure this won't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524036#comment-16524036 ] Sunil Govindan commented on YARN-8459: -- Hi [~leftnoteasy] and [~cheersyang] This issue happens as doneApplicationAttempt is in CS lock but reserve containers can be done w/o this lock. Its a race condition however we can handle this by this new flag as this flag is used within the context of app/node. One comment here, could this isStopping flag be volatile? > Capacity Scheduler should properly handle container allocation on app/node > when app/node being removed by scheduler > --- > > Key: YARN-8459 > URL: https://issues.apache.org/jira/browse/YARN-8459 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-8459.001.patch > > > Thanks [~gopalv] for reporting this issue. > In async mode, capacity scheduler can allocate/reserve containers on node/app > when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}). > This will cause some issues, for example. > a. Container for app_1 reserved on node_x. > b. At the same time, app_1 is being removed. > c. Reserve on node operation finished after app_1 removed > ({{doneApplicationAttempt}}). > For all the future runs, the node_x is completely blocked by the invalid > reservation. It keep reporting "Trying to schedule for a finished app, please > double check" for the node_x. > We need a fix to make sure this won't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523262#comment-16523262 ] Weiwei Yang commented on YARN-8459: --- Hi [~leftnoteasy] Both {{CapacityScheduler#removeNode}} and {{CapacityScheduler#doneApplicationAttempt}} remove all reservations of the node/app. Do you mean this issue happens because the allocation thread is reading a stale info of app/node? Instead of adding a stop flag, can we check if node/app is still valid while doing the reserve/allocate? > Capacity Scheduler should properly handle container allocation on app/node > when app/node being removed by scheduler > --- > > Key: YARN-8459 > URL: https://issues.apache.org/jira/browse/YARN-8459 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-8459.001.patch > > > Thanks [~gopalv] for reporting this issue. > In async mode, capacity scheduler can allocate/reserve containers on node/app > when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}). > This will cause some issues, for example. > a. Container for app_1 reserved on node_x. > b. At the same time, app_1 is being removed. > c. Reserve on node operation finished after app_1 removed > ({{doneApplicationAttempt}}). > For all the future runs, the node_x is completely blocked by the invalid > reservation. It keep reporting "Trying to schedule for a finished app, please > double check" for the node_x. > We need a fix to make sure this won't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523075#comment-16523075 ] genericqa commented on YARN-8459: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 29s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 23s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 12s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 72m 28s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}125m 48s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8459 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12929112/YARN-8459.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 02b0c0fa91df 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7a3c6e9 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/21102/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21102/testReport/ | | Max. process+thread count | 935 (vs. ulimit of 1) | | modules | C:
[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522972#comment-16522972 ] Wangda Tan commented on YARN-8459: -- Attached ver.1 patch to run Jenkins, I felt it might be not straightforward to add tests. We need a lot of mock. I'm thinking to add a chaos-monkey-like UT to just randomly start/stop nodes/apps. We should be able to get some interesting results from that. Will update ver.2 patch with tests. cc: [~sunil.gov...@gmail.com], [~Tao Yang], [~cheersyang]. > Capacity Scheduler should properly handle container allocation on app/node > when app/node being removed by scheduler > --- > > Key: YARN-8459 > URL: https://issues.apache.org/jira/browse/YARN-8459 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-8459.001.patch > > > Thanks [~gopalv] for reporting this issue. > In async mode, capacity scheduler can allocate/reserve containers on node/app > when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}). > This will cause some issues, for example. > a. Container for app_1 reserved on node_x. > b. At the same time, app_1 is being removed. > c. Reserve on node operation finished after app_1 removed > ({{doneApplicationAttempt}}). > For all the future runs, the node_x is completely blocked by the invalid > reservation. It keep reporting "Trying to schedule for a finished app, please > double check" for the node_x. > We need a fix to make sure this won't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org