[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416342#comment-17416342 ] Hadoop QA commented on YARN-10935: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 8m 39s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 7s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 32s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 15m 47s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 38s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 18s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 42s{color} | {color:green}{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 46s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1209/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}146m 17s{color} | {color:black}{color} | {color:black}{color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base:
[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416314#comment-17416314 ] Eric Payne commented on YARN-10935: --- OK, attached branch-3.2 patch. Thanks [~ebadger]. > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.4.0, 3.3.2 > > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, > YARN-10935.branch-3.2.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-10935: -- Attachment: YARN-10935.branch-3.2.003.patch > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.4.0, 3.3.2 > > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, > YARN-10935.branch-3.2.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416308#comment-17416308 ] Hadoop QA commented on YARN-10935: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 9m 35s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.10 Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 38s{color} | {color:green}{color} | {color:green} branch-2.10 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} branch-2.10 passed with JDK Azul Systems, Inc.-1.7.0_262-b10 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} branch-2.10 passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} branch-2.10 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} branch-2.10 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} branch-2.10 passed with JDK Azul Systems, Inc.-1.7.0_262-b10 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green}{color} | {color:green} branch-2.10 passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 4m 0s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 27s{color} | {color:green}{color} | {color:green} branch-2.10 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} the patch passed with JDK Azul Systems, Inc.-1.7.0_262-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green}{color} | {color:green} the patch passed with JDK Azul Systems, Inc.-1.7.0_262-b10 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10 {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 34s{color} | {color:green}{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 64m 17s{color} | {color:green}{color}
[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416276#comment-17416276 ] Eric Payne commented on YARN-10935: --- Attaching the branch-2.10 patch. Will look into the branch-3.2 patch. > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.4.0, 3.3.2 > > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-10935: -- Attachment: YARN-10935.branch-2.10.003.patch > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.4.0, 3.3.2 > > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10935: --- Fix Version/s: 3.3.2 3.4.0 [~epayne], looks like it's clean back to branch-3.3. So I committed it to trunk (3.4) and branch-3.3. But I'll need patches for branch-3.2 onwards if you'd like it backported > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.4.0, 3.3.2 > > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10958) Use correct configuration for Group service init in CSMappingPlacementRule
Peter Bacsko created YARN-10958: --- Summary: Use correct configuration for Group service init in CSMappingPlacementRule Key: YARN-10958 URL: https://issues.apache.org/jira/browse/YARN-10958 Project: Hadoop YARN Issue Type: Bug Reporter: Peter Bacsko There is a potential problem in {{CSMappingPlacementRule.java}}: {noformat} if (groups == null) { groups = Groups.getUserToGroupsMappingService(conf); } {noformat} The problem is, we're supposed to pass {{scheduler.getConf()}}. The "conf" object is the config for capacity scheduler, which does not include the property which selects the group service provider. Therefore, the current code just works by chance, because Group mapping service is already initialized at this point. See the original fix in YARN-10053. Also, need a unit test to verify it. Idea: # Create a Configuration object in which the property "hadoop.security.group.mapping" refers to an existing a test implementation. # Add a new method to {{Groups}} which nulls out the singleton instance, eg. {{Groups.reset()}}. # Create a mock CapacityScheduler where {{getConf()}} and {{getConfiguration()}} contain different settings for "hadoop.security.group.mapping". Since {{getConf()}} is the service config, this should return the config object created in step #1. # Create an instance of {{CSMappingPlacementRule}} with a single primary group rule. # Run the placement evaluation. # Expected: returned queue matches what is supposed to be coming from the test group mapping service ("testuser" --> "testqueue"). # Modify "hadoop.security.group.mapping" in the config object created in step #1. # Call {{Groups.refresh()}} which changes the group mapping ("testuser" --> "testqueue2"). This requires that the test group mapping service implement {{GroupMappingServiceProvider.cacheGroupsRefresh()}}. # Create a new instance of {{CSMappingPlacementRule}}. # Run the placement evaluation again # Expected: with the same user, the target queue has changed. This looks convoluted, but these steps make sure that: # {{CSMappingPlacementRule}} will force the initialization of groups. # We select the correct configuration for group service init. # We don't create a new {{Groups}} instance if the singleton is initialized, so we cover the original problem described in YARN-10597. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10936) Fix typo in LogAggregationFileController
[ https://issues.apache.org/jira/browse/YARN-10936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tibor Kovács reassigned YARN-10936: --- Assignee: Tibor Kovács (was: Tamas Domok) > Fix typo in LogAggregationFileController > > > Key: YARN-10936 > URL: https://issues.apache.org/jira/browse/YARN-10936 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.1 >Reporter: Tamas Domok >Assignee: Tibor Kovács >Priority: Trivial > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > -LOG.warn("Failed to check if FileSystem suppports permissions on " > +LOG.warn("Failed to check if FileSystem supports permissions on "{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10936) Fix typo in LogAggregationFileController
[ https://issues.apache.org/jira/browse/YARN-10936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-10936: -- Labels: pull-request-available (was: ) > Fix typo in LogAggregationFileController > > > Key: YARN-10936 > URL: https://issues.apache.org/jira/browse/YARN-10936 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.1 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Trivial > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > -LOG.warn("Failed to check if FileSystem suppports permissions on " > +LOG.warn("Failed to check if FileSystem supports permissions on "{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM
[ https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-10955: Description: RM is the most complex component in YARN with many basic or core services including RPC servers, event dispatchers, HTTP server, core scheduler, state managers etc., and some of them depends on other basic components like ZooKeeper, HDFS. Currently we may have to find some suspicious traces from many related metrics and tremendous logs while encountering an unclear issue, hope to locate the root cause of the problem. For example, some applications keep staying in NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam in event dispatcher, the useful traces is sinking in many metrics and logs. That's not easy to figure out what happened even for some experts, let alone common users. So I propose to add a common health check mechanism to improve troubleshooting skills for RM, in my general thought, we can * add a HealthReporter interface as follows: {code:java} public interface HealthReporter { HealthReport getHealthReport(); } {code} HealthReport can have some generic fields like isHealthy(boolean), workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and keyMetrics(Map). * make some key services implement HealthReporter interface and generate health report via evaluating the internal state. * add HealthCheckerService which can manage and monitor all reportable services, support checking and fetching health reports periodically and manually (can be triggered by REST API), publishing metrics and logs as well. was: RM is the most complex component in YARN with many basic or core services including RPC servers, event dispatchers, HTTP server, core scheduler, state managers etc., and some of them depends on other basic components like ZooKeeper, HDFS. Currently we may have to find some suspicious traces from many related metrics and tremendous logs while encountering an unclear issue, hope to locate the root cause of the problem. For example, some applications keep staying in NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam in event dispatcher, the useful traces is sinking in many metrics and logs. That's not easy to figure out what happened even for some experts, let alone common users. So I propose to add a common health check mechanism to improve troubleshooting skills for RM, in my general thought, we can * add a HealthReporter interface as follows: {code:java} public interface HealthReporter { HealthReport getHealthReport(); } {code} HealthReport can have some generic fields like isHealthy(boolean), processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and keyMetrics(Map). * make some key services implement HealthReporter interface and generate health report via evaluating the internal state. * add HealthCheckerService which can manage and monitor all reportable services, support checking and fetching health reports periodically and manually (can be triggered by REST API), publishing metrics and logs as well. > Add health check mechanism to improve troubleshooting skills for RM > --- > > Key: YARN-10955 > URL: https://issues.apache.org/jira/browse/YARN-10955 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > > RM is the most complex component in YARN with many basic or core services > including RPC servers, event dispatchers, HTTP server, core scheduler, state > managers etc., and some of them depends on other basic components like > ZooKeeper, HDFS. > Currently we may have to find some suspicious traces from many related > metrics and tremendous logs while encountering an unclear issue, hope to > locate the root cause of the problem. For example, some applications keep > staying in NEW_SAVING state, which can be caused by lost of ZooKeeper > connections or jam in event dispatcher, the useful traces is sinking in many > metrics and logs. That's not easy to figure out what happened even for some > experts, let alone common users. > So I propose to add a common health check mechanism to improve > troubleshooting skills for RM, in my general thought, we can > * add a HealthReporter interface as follows: > {code:java} > public interface HealthReporter { > HealthReport getHealthReport(); > } > {code} > HealthReport can have some generic fields like isHealthy(boolean), > workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and > keyMetrics(Map). > * make some key services implement HealthReporter interface and generate > health report via evaluating the internal state. > * add
[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM
[ https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-10955: Description: RM is the most complex component in YARN with many basic or core services including RPC servers, event dispatchers, HTTP server, core scheduler, state managers etc., and some of them depends on other basic components like ZooKeeper, HDFS. Currently we may have to find some suspicious traces from many related metrics and tremendous logs while encountering an unclear issue, hope to locate the root cause of the problem. For example, some applications keep staying in NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam in event dispatcher, the useful traces is sinking in many metrics and logs. That's not easy to figure out what happened even for some experts, let alone common users. So I propose to add a common health check mechanism to improve troubleshooting skills for RM, in my general thought, we can * add a HealthReporter interface as follows: {code:java} public interface HealthReporter { HealthReport getHealthReport(); } {code} HealthReport can have some generic fields like isHealthy(boolean), processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and keyMetrics(Map). * make some key services implement HealthReporter interface and generate health report via evaluating the internal state. * add HealthCheckerService which can manage and monitor all reportable services, support checking and fetching health reports periodically and manually (can be triggered by REST API), publishing metrics and logs as well. was: RM is the most complex component in YARN with many basic or core services including RPC servers, event dispatchers, HTTP server, core scheduler, state managers etc., and some of them depends on other basic components like ZooKeeper, HDFS. Currently we may have to find some suspicious traces from many related metrics and tremendous logs while encountering an unclear issue, hope to locate the root cause of the problem. For example, some applications keep staying in NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam in event dispatcher, the useful traces is sinking in many metrics and logs. That's not easy to figure out what happened even for some experts, let alone common users. So I propose to add a common health check mechanism to improve troubleshooting skills for RM, in my general thought, we can * add a HealthReporter interface as follows: {code:java} public interface HealthReporter { HealthReport getHealthReport(); } {code} HealthReport can have some generic fields like isHealthy(boolean), updateTime(long), diagnostics(string) and keyMetrics(Map). * make some key services implement HealthReporter interface and generate health report via evaluating the internal state. * add HealthCheckerService which can manage and monitor all reportable services, support checking and fetching health reports periodically and manually (can be triggered by REST API), publishing metrics and logs as well. > Add health check mechanism to improve troubleshooting skills for RM > --- > > Key: YARN-10955 > URL: https://issues.apache.org/jira/browse/YARN-10955 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > > RM is the most complex component in YARN with many basic or core services > including RPC servers, event dispatchers, HTTP server, core scheduler, state > managers etc., and some of them depends on other basic components like > ZooKeeper, HDFS. > Currently we may have to find some suspicious traces from many related > metrics and tremendous logs while encountering an unclear issue, hope to > locate the root cause of the problem. For example, some applications keep > staying in NEW_SAVING state, which can be caused by lost of ZooKeeper > connections or jam in event dispatcher, the useful traces is sinking in many > metrics and logs. That's not easy to figure out what happened even for some > experts, let alone common users. > So I propose to add a common health check mechanism to improve > troubleshooting skills for RM, in my general thought, we can > * add a HealthReporter interface as follows: > {code:java} > public interface HealthReporter { > HealthReport getHealthReport(); > } > {code} > HealthReport can have some generic fields like isHealthy(boolean), > processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) > and keyMetrics(Map). > * make some key services implement HealthReporter interface and generate > health report via evaluating the internal state. > * add HealthCheckerService which can manage