[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544988#comment-14544988 ] Rohith commented on YARN-3646: -- Setting RetryPolicies.RETRY_FOREVER for exceptionToPolicyMap as default policy is not sufficient, but also {{RetryPolicies.RetryForever.shouldRetry()}} should check for Connect exceptions and handle it. Otherwise shouldRetry always return RetryAction.RETRY action. Applications are getting stuck some times in case of retry policy forever - Key: YARN-3646 URL: https://issues.apache.org/jira/browse/YARN-3646 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Raju Bairishetti We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER retry policy. Yarn client is infinitely retrying in case of exceptions from the RM as it is using retrying policy as FOREVER. The problem is it is retrying for all kinds of exceptions (like ApplicationNotFoundException), even though it is not a connection failure. Due to this my application is not progressing further. *Yarn client should not retry infinitely in case of non connection failures.* We have written a simple yarn-client which is trying to get an application report for an invalid or older appId. ResourceManager is throwing an ApplicationNotFoundException as this is an invalid or older appId. But because of retry policy FOREVER, client is keep on retrying for getting the application report and ResourceManager is throwing ApplicationNotFoundException continuously. {code} private void testYarnClientRetryPolicy() throws Exception{ YarnConfiguration conf = new YarnConfiguration(); conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1); YarnClient yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645); ApplicationReport report = yarnClient.getApplicationReport(appId); } {code} *RM logs:* {noformat} 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875162 Retry#0 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1430126768987_10645' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875163 Retry#0 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3653) Expose the scheduler's KPI in web UI
[ https://issues.apache.org/jira/browse/YARN-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated YARN-3653: -- Component/s: webapp Expose the scheduler's KPI in web UI Key: YARN-3653 URL: https://issues.apache.org/jira/browse/YARN-3653 Project: Hadoop YARN Issue Type: Improvement Components: webapp Reporter: Xianyin Xin As discussed in YARN-3630, exposing the scheduler's KPI in web UI is very useful for administrator to track the scheduler's performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3583) Support of NodeLabel object instead of plain String in YarnClient side.
[ https://issues.apache.org/jira/browse/YARN-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545101#comment-14545101 ] Hadoop QA commented on YARN-3583: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 8s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 55s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 3m 1s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 38s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 5m 36s | The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | mapreduce tests | 108m 59s | Tests passed in hadoop-mapreduce-client-jobclient. | | {color:green}+1{color} | yarn tests | 0m 27s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 7m 0s | Tests passed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 2m 1s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 50m 7s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 212m 38s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | | | Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS; locked 66% of time Unsynchronized access at FileSystemRMStateStore.java:66% of time Unsynchronized access at FileSystemRMStateStore.java:[line 156] | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733060/0002-YARN-3583.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cbc01ed | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/7949/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-mapreduce-client-jobclient test log | https://builds.apache.org/job/PreCommit-YARN-Build/7949/artifact/patchprocess/testrun_hadoop-mapreduce-client-jobclient.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7949/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/7949/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7949/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7949/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7949/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7949/console | This message was automatically generated. Support of NodeLabel object instead of plain String in YarnClient side. --- Key: YARN-3583 URL: https://issues.apache.org/jira/browse/YARN-3583 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3583.patch, 0002-YARN-3583.patch Similar to YARN-3521, use NodeLabel objects in YarnClient side apis. getLabelsToNodes/getNodeToLabels api's can use NodeLabel object instead of using plain label name. This will help to bring other label details such as Exclusivity to client side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3651) Tracking url in ApplicaitonCLI wrong for running application
Bibin A Chundatt created YARN-3651: -- Summary: Tracking url in ApplicaitonCLI wrong for running application Key: YARN-3651 URL: https://issues.apache.org/jira/browse/YARN-3651 Project: Hadoop YARN Issue Type: Bug Components: applications, resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Application URL in Application CLI wrong Steps to reproduce == 1. Start HA setup 2.Submit application to cluster 3.Execute command ./yarn application -list 4.Observer tracking URL shown {code} 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History server at /IP:45034 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id --- Tracking-URL application_1431672734347_0003 *http://host-10-19-92-117:13013* {code} *Expected* http://IP:64323/proxy/application_1431672734347_0003 / -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3630) YARN should suggest a heartbeat interval for applications
[ https://issues.apache.org/jira/browse/YARN-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545188#comment-14545188 ] Xianyin Xin commented on YARN-3630: --- Thanks, [~leftnoteasy] and [~kasha]. [~leftnoteasy], I think {{ScheulerMetrics}} is a good idea, but I think we should also consider some things: # {{ScheulerMetrics}} tracks various indexes of a scheduler, but we also have {{QueueMetrics}}. Not exactly, the {{root.metrics}} gives us most of the information of the scheduler, then how to deal with the relation between the two; # #events waiting for being handled is an important index for evaluating the scheduler's load, but it is not owned by scheduler, it is maintained by {{ResourceManager#SchedulerEventDispatcher}}, then who will maintain the {{ScheulerMetrics}}, the {{SchedulerEventDispatcher}} or the scheduler itself? From the literal meaning, {{ScheulerMetrics}} should be maintained by scheduler. Anyway, considering the WebUI improvement you mentioned, a {{ScheulerMetrics}} is need. Created another jira YARN-3652 to discuss this. Thanks [~kasha] for valuable suggestions on the policy of determining the heartbeat interval, in the following days I'll work for a draft. YARN should suggest a heartbeat interval for applications - Key: YARN-3630 URL: https://issues.apache.org/jira/browse/YARN-3630 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.7.0 Reporter: Zoltán Zvara Assignee: Xianyin Xin Priority: Minor It seems currently applications - for example Spark - are not adaptive to RM regarding heartbeat intervals. RM should be able to suggest a desired heartbeat interval to applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545097#comment-14545097 ] Akira AJISAKA commented on YARN-3069: - Thanks [~rchiang] for the update. Looked the patch from {{yarn.nodemanager.aux-services.mapreduce_shuffle.class}} to {{yarn.client.app-submission.poll-interval}}. bq. {code} !-- Minicluster Configuration -- {code} I'm thinking it would be better for users to document that the configuration is only used for testing. bq. yarn.minicluster.yarn.nodemanager.resource.memory-mb Default value is 4096. bq. yarn.node-labels.fs-store.retry-policy-spec Retry policy used for FileSystem node label store. The policy is specified by N pairs of sleep-time in milliseconds and number-of-retries s1,n1,s2,n2, Default value is 2000, 500. (I'm thinking the default number of retries is too high.) bq. {code} description URI for NodeLabelManager /description {code} Would you document that default is in local: {{/tmp/hadoop-yarn-$\{user\}/node-labels/}} in the description? It is described in {{FileSystemNodeLabelsStore#getDefaultFSNodeLabelsRootDir}}. bq. yarn.node-labels.configuration-type Set configuration type for node labels. Administrators can specify centralized or distributed. bq. yarn.client.app-submission.poll-interval Can we move this parameter to DeprecatedProperties.md? Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri
[jira] [Updated] (YARN-3651) Tracking url in ApplicationCLI wrong for running application
[ https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3651: --- Summary: Tracking url in ApplicationCLI wrong for running application (was: Tracking url in ApplicaitonCLI wrong for running application) Tracking url in ApplicationCLI wrong for running application Key: YARN-3651 URL: https://issues.apache.org/jira/browse/YARN-3651 Project: Hadoop YARN Issue Type: Bug Components: applications, resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Application URL in Application CLI wrong Steps to reproduce == 1. Start HA setup 2.Submit application to cluster 3.Execute command ./yarn application -list 4.Observer tracking URL shown {code} 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History server at /IP:45034 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id --- Tracking-URL application_1431672734347_0003 *http://host-10-19-92-117:13013* {code} *Expected* http://IP:64323/proxy/application_1431672734347_0003 / -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3547) FairScheduler: Apps that have no resource demand should not participate scheduling
[ https://issues.apache.org/jira/browse/YARN-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated YARN-3547: -- Attachment: YARN-3547.005.patch A patch using {{getDemand() - getResourceUsage()}}. FairScheduler: Apps that have no resource demand should not participate scheduling -- Key: YARN-3547 URL: https://issues.apache.org/jira/browse/YARN-3547 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Xianyin Xin Assignee: Xianyin Xin Attachments: YARN-3547.001.patch, YARN-3547.002.patch, YARN-3547.003.patch, YARN-3547.004.patch, YARN-3547.005.patch At present, all of the 'running' apps participate the scheduling process, however, most of them may have no resource demand on a production cluster, as the app's status is running other than waiting for resource at the most of the app's lifetime. It's not a wise way we sort all the 'running' apps and try to fulfill them, especially on a large-scale cluster which has heavy scheduling load. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3630) YARN should suggest a heartbeat interval for applications
[ https://issues.apache.org/jira/browse/YARN-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545218#comment-14545218 ] Sunil G commented on YARN-3630: --- bq.Are we considering automatically slowing down the NM heartbeats as well? YARN should suggest a heartbeat interval for applications - Key: YARN-3630 URL: https://issues.apache.org/jira/browse/YARN-3630 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.7.0 Reporter: Zoltán Zvara Assignee: Xianyin Xin Priority: Minor It seems currently applications - for example Spark - are not adaptive to RM regarding heartbeat intervals. RM should be able to suggest a desired heartbeat interval to applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3583) Support of NodeLabel object instead of plain String in YarnClient side.
[ https://issues.apache.org/jira/browse/YARN-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545131#comment-14545131 ] Sunil G commented on YARN-3583: --- bq.Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS; locked 66% of time This doesn't look related to this patch. Support of NodeLabel object instead of plain String in YarnClient side. --- Key: YARN-3583 URL: https://issues.apache.org/jira/browse/YARN-3583 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3583.patch, 0002-YARN-3583.patch Similar to YARN-3521, use NodeLabel objects in YarnClient side apis. getLabelsToNodes/getNodeToLabels api's can use NodeLabel object instead of using plain label name. This will help to bring other label details such as Exclusivity to client side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3547) FairScheduler: Apps that have no resource demand should not participate scheduling
[ https://issues.apache.org/jira/browse/YARN-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544995#comment-14544995 ] Xianyin Xin commented on YARN-3547: --- [~kasha], can you please have a look? FairScheduler: Apps that have no resource demand should not participate scheduling -- Key: YARN-3547 URL: https://issues.apache.org/jira/browse/YARN-3547 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Xianyin Xin Assignee: Xianyin Xin Attachments: YARN-3547.001.patch, YARN-3547.002.patch, YARN-3547.003.patch, YARN-3547.004.patch, YARN-3547.005.patch At present, all of the 'running' apps participate the scheduling process, however, most of them may have no resource demand on a production cluster, as the app's status is running other than waiting for resource at the most of the app's lifetime. It's not a wise way we sort all the 'running' apps and try to fulfill them, especially on a large-scale cluster which has heavy scheduling load. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3652) A SchedulerMetrics may be need for evaluating the scheduler's performance
[ https://issues.apache.org/jira/browse/YARN-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated YARN-3652: -- Summary: A SchedulerMetrics may be need for evaluating the scheduler's performance (was: A {{SchedulerMetrics}} may be need for evaluating the scheduler's performance) A SchedulerMetrics may be need for evaluating the scheduler's performance - Key: YARN-3652 URL: https://issues.apache.org/jira/browse/YARN-3652 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Reporter: Xianyin Xin As discussed in YARN-3630, a {{SchedulerMetrics}} may be need for evaluating the scheduler's performance. The performance indexes includes #events waiting for being handled by scheduler, the throughput, the scheduling delay and/or other indicators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3652) A {{SchedulerMetrics}} may be need for evaluating the scheduler's performance
Xianyin Xin created YARN-3652: - Summary: A {{SchedulerMetrics}} may be need for evaluating the scheduler's performance Key: YARN-3652 URL: https://issues.apache.org/jira/browse/YARN-3652 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Reporter: Xianyin Xin As discussed in YARN-3630, a {{SchedulerMetrics}} may be need for evaluating the scheduler's performance. The performance indexes includes #events waiting for being handled by scheduler, the throughput, the scheduling delay and/or other indicators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3630) YARN should suggest a heartbeat interval for applications
[ https://issues.apache.org/jira/browse/YARN-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545219#comment-14545219 ] Sunil G commented on YARN-3630: --- bq.Are we considering automatically slowing down the NM heartbeats as well? YARN should suggest a heartbeat interval for applications - Key: YARN-3630 URL: https://issues.apache.org/jira/browse/YARN-3630 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.7.0 Reporter: Zoltán Zvara Assignee: Xianyin Xin Priority: Minor It seems currently applications - for example Spark - are not adaptive to RM regarding heartbeat intervals. RM should be able to suggest a desired heartbeat interval to applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3653) Expose the scheduler's KPI in web UI
Xianyin Xin created YARN-3653: - Summary: Expose the scheduler's KPI in web UI Key: YARN-3653 URL: https://issues.apache.org/jira/browse/YARN-3653 Project: Hadoop YARN Issue Type: Improvement Reporter: Xianyin Xin As discussed in YARN-3630, exposing the scheduler's KPI in web UI is very useful for administrator to track the scheduler's performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3651) Tracking url in ApplicationCLI wrong for running application
[ https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3651: --- Description: Application URL in Application CLI wrong Steps to reproduce == 1. Start HA setup insecure mode 2.Configure HTTPS_ONLY 3.Submit application to cluster 4.Execute command ./yarn application -list 5.Observer tracking URL shown {code} 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History server at /IP:45034 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id --- Tracking-URL application_1431672734347_0003 *http://host-10-19-92-117:13013* {code} *Expected* https://IP:64323/proxy/application_1431672734347_0003 / was: Application URL in Application CLI wrong Steps to reproduce == 1. Start HA setup 2.Submit application to cluster 3.Execute command ./yarn application -list 4.Observer tracking URL shown {code} 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History server at /IP:45034 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id --- Tracking-URL application_1431672734347_0003 *http://host-10-19-92-117:13013* {code} *Expected* http://IP:64323/proxy/application_1431672734347_0003 / Priority: Minor (was: Major) Tracking url in ApplicationCLI wrong for running application Key: YARN-3651 URL: https://issues.apache.org/jira/browse/YARN-3651 Project: Hadoop YARN Issue Type: Bug Components: applications, resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Priority: Minor Application URL in Application CLI wrong Steps to reproduce == 1. Start HA setup insecure mode 2.Configure HTTPS_ONLY 3.Submit application to cluster 4.Execute command ./yarn application -list 5.Observer tracking URL shown {code} 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History server at /IP:45034 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id --- Tracking-URL application_1431672734347_0003 *http://host-10-19-92-117:13013* {code} *Expected* https://IP:64323/proxy/application_1431672734347_0003 / -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3630) YARN should suggest a heartbeat interval for applications
[ https://issues.apache.org/jira/browse/YARN-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545225#comment-14545225 ] Sunil G commented on YARN-3630: --- bq.Are we considering automatically slowing down the NM heartbeats as well? +1. It will be wise to slowdown the heartbeats from NM which shares a less load. However there should be a limit or range to which it can be slowed down even in lighter load. Else i feel more starvation can happen for applications. YARN should suggest a heartbeat interval for applications - Key: YARN-3630 URL: https://issues.apache.org/jira/browse/YARN-3630 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.7.0 Reporter: Zoltán Zvara Assignee: Xianyin Xin Priority: Minor It seems currently applications - for example Spark - are not adaptive to RM regarding heartbeat intervals. RM should be able to suggest a desired heartbeat interval to applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545264#comment-14545264 ] Devaraj K commented on YARN-3646: - You can probably avoid this situation by setting a bigger value for yarn.resourcemanager.connect.max-wait.ms(like below) if you want to wait for long time to establish a connection to RM with retries. {code:xml} conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, Integer.MAX_VALUE); {code} Anyway it seems this issue needs to be fixed. Applications are getting stuck some times in case of retry policy forever - Key: YARN-3646 URL: https://issues.apache.org/jira/browse/YARN-3646 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Raju Bairishetti We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER retry policy. Yarn client is infinitely retrying in case of exceptions from the RM as it is using retrying policy as FOREVER. The problem is it is retrying for all kinds of exceptions (like ApplicationNotFoundException), even though it is not a connection failure. Due to this my application is not progressing further. *Yarn client should not retry infinitely in case of non connection failures.* We have written a simple yarn-client which is trying to get an application report for an invalid or older appId. ResourceManager is throwing an ApplicationNotFoundException as this is an invalid or older appId. But because of retry policy FOREVER, client is keep on retrying for getting the application report and ResourceManager is throwing ApplicationNotFoundException continuously. {code} private void testYarnClientRetryPolicy() throws Exception{ YarnConfiguration conf = new YarnConfiguration(); conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1); YarnClient yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645); ApplicationReport report = yarnClient.getApplicationReport(appId); } {code} *RM logs:* {noformat} 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875162 Retry#0 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1430126768987_10645' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875163 Retry#0 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3651) Tracking url in ApplicationCLI wrong for running application
[ https://issues.apache.org/jira/browse/YARN-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545353#comment-14545353 ] Bibin A Chundatt commented on YARN-3651: Also when configured as HTTPS_ONLY why http port is opened . HI All any comments on the same?? Tracking url in ApplicationCLI wrong for running application Key: YARN-3651 URL: https://issues.apache.org/jira/browse/YARN-3651 Project: Hadoop YARN Issue Type: Bug Components: applications, resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Priority: Minor Application URL in Application CLI wrong Steps to reproduce == 1. Start HA setup insecure mode 2.Configure HTTPS_ONLY 3.Submit application to cluster 4.Execute command ./yarn application -list 5.Observer tracking URL shown {code} 15/05/15 13:34:38 INFO client.AHSProxy: Connecting to Application History server at /IP:45034 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id --- Tracking-URL application_1431672734347_0003 *http://host-10-19-92-117:13013* {code} *Expected* https://IP:64323/proxy/application_1431672734347_0003 / -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3652) A SchedulerMetrics may be need for evaluating the scheduler's performance
[ https://issues.apache.org/jira/browse/YARN-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545260#comment-14545260 ] Sunil G commented on YARN-3652: --- Hi [~xinxianyin] This will be a very helpful feature and thanks for working on same. Few points: 1. *Throughput*: Are you mentioning about #events processed over a period of time? If so, how can we set the timeline by which throughput is calculated (configurable?)? A clear indicator from this will be like we can predict possible end timeline for the pending events in dispatcher queue. Adding throughput with #no of pending events may give much more better indication about RM overload. 2. However there are many events coming to scheduler, if possible a filter for the events based on events type may be helpful to give an accuracy for throughout and scheduling delay. A SchedulerMetrics may be need for evaluating the scheduler's performance - Key: YARN-3652 URL: https://issues.apache.org/jira/browse/YARN-3652 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Reporter: Xianyin Xin As discussed in YARN-3630, a {{SchedulerMetrics}} may be need for evaluating the scheduler's performance. The performance indexes includes #events waiting for being handled by scheduler, the throughput, the scheduling delay and/or other indicators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1519) check if sysconf is implemented before using it
[ https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545392#comment-14545392 ] Hudson commented on YARN-1519: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #928 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/928/]) YARN-1519. Check in container-executor if sysconf is implemented before using it (Radim Kolar and Eric Payne via raviprak) (raviprak: rev 53fe4eff09fdaeed75a8cad3a26156bf963a8d37) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c check if sysconf is implemented before using it --- Key: YARN-1519 URL: https://issues.apache.org/jira/browse/YARN-1519 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.3.0 Reporter: Radim Kolar Assignee: Radim Kolar Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, nodemgr-sysconf.txt If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to segfault because invalid pointer gets passed to libc function. fix: enforce minimum value 1024, same method is used in hadoop-common native code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545410#comment-14545410 ] Hudson commented on YARN-3505: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #197 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/197/]) YARN-3505. Node's Log Aggregation Report with SUCCEED should not cached in RMApps. Contributed by Xuan Gong. (junping_du: rev 15ccd967ee3e7046a50522089f67ba01f36ec76a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppLogAggregationStatusBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/LogAggregationReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStatusEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/LogAggregationReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatRequest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1519) check if sysconf is implemented before using it
[ https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545415#comment-14545415 ] Hudson commented on YARN-1519: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #197 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/197/]) YARN-1519. Check in container-executor if sysconf is implemented before using it (Radim Kolar and Eric Payne via raviprak) (raviprak: rev 53fe4eff09fdaeed75a8cad3a26156bf963a8d37) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c check if sysconf is implemented before using it --- Key: YARN-1519 URL: https://issues.apache.org/jira/browse/YARN-1519 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.3.0 Reporter: Radim Kolar Assignee: Radim Kolar Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, nodemgr-sysconf.txt If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to segfault because invalid pointer gets passed to libc function. fix: enforce minimum value 1024, same method is used in hadoop-common native code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545250#comment-14545250 ] Lavkesh Lahngir commented on YARN-3591: --- What about zombie files lying in the various paths..In the case of disk becoming good, they will be there forever. Do we not care? Also I was thinking to remove resources which have public and user level visibility, because app level resources will be deleted automatically. Thoughts? Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545387#comment-14545387 ] Hudson commented on YARN-3505: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #928 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/928/]) YARN-3505. Node's Log Aggregation Report with SUCCEED should not cached in RMApps. Contributed by Xuan Gong. (junping_du: rev 15ccd967ee3e7046a50522089f67ba01f36ec76a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/LogAggregationReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppLogAggregationStatusBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStatusEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatRequest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/LogAggregationReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3605) _ as method name may not be supported much longer
[ https://issues.apache.org/jira/browse/YARN-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated YARN-3605: -- Labels: newbie (was: ) _ as method name may not be supported much longer - Key: YARN-3605 URL: https://issues.apache.org/jira/browse/YARN-3605 Project: Hadoop YARN Issue Type: Bug Reporter: Robert Joseph Evans Labels: newbie I was trying to run the precommit test on my mac under JDK8, and I got the following error related to javadocs. (use of '_' as an identifier might not be supported in releases after Java SE 8) It looks like we need to at least change the method name to not be '_' any more, or possibly replace the HTML generation with something more standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545471#comment-14545471 ] Hudson commented on YARN-3505: -- FAILURE: Integrated in Hadoop-trunk-Commit #7841 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7841/]) YARN-3505 addendum: fix an issue in previous patch. (junping_du: rev 03a293aed6de101b0cae1a294f506903addcaa75) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-624) Support gang scheduling in the AM RM protocol
[ https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545486#comment-14545486 ] john lilley commented on YARN-624: -- I would like to +1 this feature, and illustrate our use cases. Currently there are two: -- Finding strongly-connected subgraphs. This is a central step in data-quality/matching applications, because after record-matching is performed in a distributed fashion, the match pairs (edges) must be turned into match groups (subgraphs). It is very inefficient to process this using a traditional independent-task YARN model. -- Machine-learning model training. There are many models that lend themselves to distributed processing, and even those that don't can benefit from parallel genetic algorithm that competes multiple models and topologies in parallel. In both these cases we are considering a custom AM that runs like: -- Asks for M containers -- Accepts as few as N containers, but only after not getting M for some period of time (heuristics TBD). -- Possibly, after getting non-zero but N containers for some time, release them all, sleep a while, and try again (deadlock avoidance). This algorithm would be much better run by the RM, because it can: -- Immediately fail the AM if N containers are impossible. -- Avoid idle incomplete sets of containers while waiting for a sufficient gang. -- Avoid deadlock. Support gang scheduling in the AM RM protocol - Key: YARN-624 URL: https://issues.apache.org/jira/browse/YARN-624 Project: Hadoop YARN Issue Type: Sub-task Components: api, scheduler Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Per discussion on YARN-392 and elsewhere, gang scheduling, in which a scheduler runs a set of tasks when they can all be run at the same time, would be a useful feature for YARN schedulers to support. Currently, AMs can approximate this by holding on to containers until they get all the ones they need. However, this lends itself to deadlocks when different AMs are waiting on the same containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545548#comment-14545548 ] Srikanth Sundarrajan commented on YARN-3646: {quote} You can probably avoid this situation by setting a bigger value {quote} Would this not cause the client to wait for too long (well after the rm has come back online) Applications are getting stuck some times in case of retry policy forever - Key: YARN-3646 URL: https://issues.apache.org/jira/browse/YARN-3646 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Raju Bairishetti We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER retry policy. Yarn client is infinitely retrying in case of exceptions from the RM as it is using retrying policy as FOREVER. The problem is it is retrying for all kinds of exceptions (like ApplicationNotFoundException), even though it is not a connection failure. Due to this my application is not progressing further. *Yarn client should not retry infinitely in case of non connection failures.* We have written a simple yarn-client which is trying to get an application report for an invalid or older appId. ResourceManager is throwing an ApplicationNotFoundException as this is an invalid or older appId. But because of retry policy FOREVER, client is keep on retrying for getting the application report and ResourceManager is throwing ApplicationNotFoundException continuously. {code} private void testYarnClientRetryPolicy() throws Exception{ YarnConfiguration conf = new YarnConfiguration(); conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1); YarnClient yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645); ApplicationReport report = yarnClient.getApplicationReport(appId); } {code} *RM logs:* {noformat} 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875162 Retry#0 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1430126768987_10645' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875163 Retry#0 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545710#comment-14545710 ] Hudson commented on YARN-3505: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #196 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/196/]) YARN-3505. Node's Log Aggregation Report with SUCCEED should not cached in RMApps. Contributed by Xuan Gong. (junping_du: rev 15ccd967ee3e7046a50522089f67ba01f36ec76a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppLogAggregationStatusBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/LogAggregationReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/LogAggregationReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatRequest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStatusEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto YARN-3505 addendum: fix an issue in previous patch. (junping_du: rev 03a293aed6de101b0cae1a294f506903addcaa75) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1519) check if sysconf is implemented before using it
[ https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545715#comment-14545715 ] Hudson commented on YARN-1519: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #196 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/196/]) YARN-1519. Check in container-executor if sysconf is implemented before using it (Radim Kolar and Eric Payne via raviprak) (raviprak: rev 53fe4eff09fdaeed75a8cad3a26156bf963a8d37) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c check if sysconf is implemented before using it --- Key: YARN-1519 URL: https://issues.apache.org/jira/browse/YARN-1519 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.3.0 Reporter: Radim Kolar Assignee: Radim Kolar Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, nodemgr-sysconf.txt If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to segfault because invalid pointer gets passed to libc function. fix: enforce minimum value 1024, same method is used in hadoop-common native code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545702#comment-14545702 ] Devaraj K commented on YARN-3646: - bq. Would this not cause the client to wait for too long (well after the rm has come back online) yarn.resourcemanager.connect.max-wait.ms is the max time to wait to establish a connection to RM, If the RM comes online before this time it will connect immediately. IPC client would be internally retrying to connect RM for every yarn.resourcemanager.connect.retry-interval.ms (default value 30 * 1000) and exception will be thrown if it can't connect for yarn.resourcemanager.connect.max-wait.ms. Applications are getting stuck some times in case of retry policy forever - Key: YARN-3646 URL: https://issues.apache.org/jira/browse/YARN-3646 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Raju Bairishetti We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER retry policy. Yarn client is infinitely retrying in case of exceptions from the RM as it is using retrying policy as FOREVER. The problem is it is retrying for all kinds of exceptions (like ApplicationNotFoundException), even though it is not a connection failure. Due to this my application is not progressing further. *Yarn client should not retry infinitely in case of non connection failures.* We have written a simple yarn-client which is trying to get an application report for an invalid or older appId. ResourceManager is throwing an ApplicationNotFoundException as this is an invalid or older appId. But because of retry policy FOREVER, client is keep on retrying for getting the application report and ResourceManager is throwing ApplicationNotFoundException continuously. {code} private void testYarnClientRetryPolicy() throws Exception{ YarnConfiguration conf = new YarnConfiguration(); conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1); YarnClient yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645); ApplicationReport report = yarnClient.getApplicationReport(appId); } {code} *RM logs:* {noformat} 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875162 Retry#0 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1430126768987_10645' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875163 Retry#0 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1519) check if sysconf is implemented before using it
[ https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545732#comment-14545732 ] Hudson commented on YARN-1519: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2144 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2144/]) YARN-1519. Check in container-executor if sysconf is implemented before using it (Radim Kolar and Eric Payne via raviprak) (raviprak: rev 53fe4eff09fdaeed75a8cad3a26156bf963a8d37) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/CHANGES.txt check if sysconf is implemented before using it --- Key: YARN-1519 URL: https://issues.apache.org/jira/browse/YARN-1519 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.3.0 Reporter: Radim Kolar Assignee: Radim Kolar Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, nodemgr-sysconf.txt If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to segfault because invalid pointer gets passed to libc function. fix: enforce minimum value 1024, same method is used in hadoop-common native code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545727#comment-14545727 ] Hudson commented on YARN-3505: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2144 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2144/]) YARN-3505. Node's Log Aggregation Report with SUCCEED should not cached in RMApps. Contributed by Xuan Gong. (junping_du: rev 15ccd967ee3e7046a50522089f67ba01f36ec76a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/LogAggregationReportPBImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStatusEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppLogAggregationStatusBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/LogAggregationReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatRequest.java YARN-3505 addendum: fix an issue in previous patch. (junping_du: rev 03a293aed6de101b0cae1a294f506903addcaa75) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2421) RM still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2421: - Summary: RM still allocates containers to an app in the FINISHING state (was: CapacityScheduler still allocates containers to an app in the FINISHING state) RM still allocates containers to an app in the FINISHING state -- Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3609) Move load labels from storage from serviceInit to serviceStart to make it works with RM HA case.
[ https://issues.apache.org/jira/browse/YARN-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546262#comment-14546262 ] Hadoop QA commented on YARN-3609: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 58s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 45s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 38s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 2m 44s | The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 50m 25s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 91m 45s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | | | Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS; locked 66% of time Unsynchronized access at FileSystemRMStateStore.java:66% of time Unsynchronized access at FileSystemRMStateStore.java:[line 156] | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733214/YARN-3609.3.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 03a293a | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7951/artifact/patchprocess/whitespace.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/7951/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7951/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7951/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7951/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7951/console | This message was automatically generated. Move load labels from storage from serviceInit to serviceStart to make it works with RM HA case. Key: YARN-3609 URL: https://issues.apache.org/jira/browse/YARN-3609 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3609.1.preliminary.patch, YARN-3609.2.patch, YARN-3609.3.patch Now RMNodeLabelsManager loads label when serviceInit, but RMActiveService.start() is called when RM HA transition happens. We haven't done this before because queue's initialization happens in serviceInit as well, we need make sure labels added to system before init queue, after YARN-2918, we should be able to do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546374#comment-14546374 ] Craig Welch commented on YARN-3626: --- Right, going back to [~cnauroth], [~vinodkv], we chatted and you asserted that the original approach can't work, but it seemed to, it's not entirely clear to me why it shouldn't... On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3659) Federation Router (hiding multiple RMs for ApplicationClientProtocol)
[ https://issues.apache.org/jira/browse/YARN-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-3659: - Description: This JIRA tracks the design/implementation of the layer for routing ApplicaitonClientProtocol requests to the appropriate RM(s) in a federated YARN cluster. was: This JIRA tracks the design/implementation of the layer for routing ApplicationClientProtocol requests to the appropriate RM(s) in a federated YARN cluster. Federation Router (hiding multiple RMs for ApplicationClientProtocol) - Key: YARN-3659 URL: https://issues.apache.org/jira/browse/YARN-3659 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Giovanni Matteo Fumarola This JIRA tracks the design/implementation of the layer for routing ApplicaitonClientProtocol requests to the appropriate RM(s) in a federated YARN cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3661) Federation UI
Giovanni Matteo Fumarola created YARN-3661: -- Summary: Federation UI Key: YARN-3661 URL: https://issues.apache.org/jira/browse/YARN-3661 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Giovanni Matteo Fumarola The UIs provided by each RM, provide a correct local view of what is running in a sub-cluster. In the context of federation we need new UIs that can track load, jobs, users across sub-clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546454#comment-14546454 ] Gour Saha commented on YARN-3561: - [~jianhe] it is consistently reproducible on debian 7. Can you provide a quick instruction on how to enable debug level in NM logs? [~chackra] If possible, can you set debug level on for NM logs, re-run the test and provide the logs again? Non-AM Containers continue to run even after AM is stopped -- Key: YARN-3561 URL: https://issues.apache.org/jira/browse/YARN-3561 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.6.0 Environment: debian 7 Reporter: Gour Saha Priority: Critical Attachments: app0001.zip Non-AM containers continue to run even after application is stopped. This occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a Hadoop 2.6 deployment. Following are the NM logs from 2 different nodes: *host-07* - where Slider AM was running *host-03* - where Storm NIMBUS container was running. *Note:* The logs are partial, starting with the time when the relevant Slider AM and NIMBUS containers were allocated, till the time when the Slider AM was stopped. Also, the large number of Memory usage log lines were removed keeping only a few starts and ends of every segment. *NM log from host-07 where Slider AM container was running:* {noformat} 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for container_1428575950531_0020_02_01 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for container_1428575950531_0021_01_01 by user yarn 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new application reference for app application_1428575950531_0021 2015-04-29 00:41:10,323 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from NEW to INITING 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1428575950531_0021 CONTAINERID=container_1428575950531_0021_01_01 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-04-29 00:41:10,351 INFO application.Application (ApplicationImpl.java:transition(304)) - Adding container_1428575950531_0021_01_01 to application application_1428575950531_0021 2015-04-29 00:41:10,352 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from INITING to RUNNING 2015-04-29 00:41:10,356 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING 2015-04-29 00:41:10,357 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId application_1428575950531_0021 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO
[jira] [Assigned] (YARN-3659) Federation Router (hiding multiple RMs for ApplicationClientProtocol)
[ https://issues.apache.org/jira/browse/YARN-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola reassigned YARN-3659: -- Assignee: Giovanni Matteo Fumarola Federation Router (hiding multiple RMs for ApplicationClientProtocol) - Key: YARN-3659 URL: https://issues.apache.org/jira/browse/YARN-3659 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Giovanni Matteo Fumarola Assignee: Giovanni Matteo Fumarola This JIRA tracks the design/implementation of the layer for routing ApplicaitonClientProtocol requests to the appropriate RM(s) in a federated YARN cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3654) ContainerLogsPage web UI should not have meta-refresh
[ https://issues.apache.org/jira/browse/YARN-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546217#comment-14546217 ] Hadoop QA commented on YARN-3654: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 14s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 40s | The applied patch generated 2 new checkstyle issues (total was 12, now 13). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 1m 7s | The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 6m 4s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 43m 25s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-nodemanager | | | Class org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebAppFilter defines non-transient non-serializable instance field nmConf In NMWebAppFilter.java:instance field nmConf In NMWebAppFilter.java | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733248/YARN-3654.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 03a293a | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7953/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/7953/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7953/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7953/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7953/console | This message was automatically generated. ContainerLogsPage web UI should not have meta-refresh - Key: YARN-3654 URL: https://issues.apache.org/jira/browse/YARN-3654 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.1 Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3654.1.patch Currently, When we try to find the container logs for the finished application, it will re-direct to the url which we re-configured for yarn.log.server.url in yarn-site.xml. But in ContainerLogsPage, we are using meta-refresh: {code} set(TITLE, join(Redirecting to log server for , $(CONTAINER_ID))); html.meta_http(refresh, 1; url= + redirectUrl); {code} which is not good for some browsers which need to enable the meta-refresh in their security setting, especially for IE which meta-refresh is considered a security hole. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) CapacityScheduler still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546254#comment-14546254 ] Jason Lowe commented on YARN-2421: -- +1 latest patch lgtm. Committing this. CapacityScheduler still allocates containers to an app in the FINISHING state - Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546257#comment-14546257 ] Hadoop QA commented on YARN-3632: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 35s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:red}-1{color} | javac | 7m 32s | The applied patch generated 1 additional warning messages. | | {color:green}+1{color} | javadoc | 9m 32s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 45s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 2s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 1m 18s | The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 50m 1s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 86m 18s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | | | Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS; locked 66% of time Unsynchronized access at FileSystemRMStateStore.java:66% of time Unsynchronized access at FileSystemRMStateStore.java:[line 156] | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733235/YARN-3632.4.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 03a293a | | javac | https://builds.apache.org/job/PreCommit-YARN-Build/7952/artifact/patchprocess/diffJavacWarnings.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7952/artifact/patchprocess/whitespace.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/7952/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7952/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7952/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7952/console | This message was automatically generated. Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) RM still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546268#comment-14546268 ] Hudson commented on YARN-2421: -- FAILURE: Integrated in Hadoop-trunk-Commit #7842 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7842/]) YARN-2421. RM still allocates containers to an app in the FINISHING state. Contributed by Chang Li (jlowe: rev f7e051c4310024d4040ad466c34432c72e88b0fc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java RM still allocates containers to an app in the FINISHING state -- Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Thomas Graves Assignee: Chang Li Fix For: 2.8.0 Attachments: YARN-2421.4.patch, YARN-2421.5.patch, YARN-2421.6.patch, YARN-2421.7.patch, YARN-2421.8.patch, YARN-2421.9.patch, yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3632: -- Attachment: YARN-3632.5.patch Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546271#comment-14546271 ] Craig Welch commented on YARN-3632: --- One line change to address missing whitespace issue. Again, the javac and findbugs don't appear to have anything to do with the patch. Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546296#comment-14546296 ] Jian He commented on YARN-2268: --- I think the lock file solution only suits zk-store, not for other state-store implementations. The current approach of polling web service should be more general. Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith Attachments: 0001-YARN-2268.patch YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546331#comment-14546331 ] Hadoop QA commented on YARN-3626: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 24s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 47s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 7s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 16s | The applied patch generated 2 new checkstyle issues (total was 59, now 58). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 33s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | mapreduce tests | 0m 45s | Tests passed in hadoop-mapreduce-client-common. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 6m 1s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 48m 49s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733257/YARN-3626.9.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / f7e051c | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7954/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-mapreduce-client-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7954/artifact/patchprocess/testrun_hadoop-mapreduce-client-common.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7954/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7954/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7954/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7954/console | This message was automatically generated. On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
Ishai Menache created YARN-3656: --- Summary: LowCost: A Cost-Based Placement Agent for YARN Reservations Key: YARN-3656 URL: https://issues.apache.org/jira/browse/YARN-3656 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Ishai Menache YARN-1051 enables SLA support by allowing users to reserve cluster capacity ahead of time. YARN-1710 introduced a greedy agent for placing user reservations. The greedy agent makes fast placement decisions but at the cost of ignoring the cluster committed resources, which might result in blocking the cluster resources for certain periods of time, and in turn rejecting some arriving jobs. We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishai Menache updated YARN-3656: Attachment: LowCostRayonExternal.pdf This tech-report summarizes the details of LowCost, as well as our experimental results which show benefits in using LowCost on a variety of performance metrics LowCost: A Cost-Based Placement Agent for YARN Reservations --- Key: YARN-3656 URL: https://issues.apache.org/jira/browse/YARN-3656 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Ishai Menache Attachments: LowCostRayonExternal.pdf YARN-1051 enables SLA support by allowing users to reserve cluster capacity ahead of time. YARN-1710 introduced a greedy agent for placing user reservations. The greedy agent makes fast placement decisions but at the cost of ignoring the cluster committed resources, which might result in blocking the cluster resources for certain periods of time, and in turn rejecting some arriving jobs. We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546365#comment-14546365 ] Chris Nauroth commented on YARN-3626: - I don't fully understand the objection to the former patch that had been committed. bq. The new configuration added is supposed to be per app, but it is now a server side configuration. There was a new YARN configuration property for triggering this behavior, but the MR application would toggle on that YARN property only if the MR job submission had {{MAPREDUCE_JOB_USER_CLASSPATH_FIRST}} on. From {{MRApps}}: {code} boolean userClassesTakesPrecedence = conf.getBoolean(MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, false); if (userClassesTakesPrecedence) { conf.set(YarnConfiguration.YARN_APPLICATION_CLASSPATH_PREPEND_DISTCACHE, true); } {code} I thought this implemented per app behavior, because it could vary between MR app submission instances. It would not be a requirement to put {{YARN_APPLICATION_CLASSPATH_PREPEND_DISTCACHE}} into the server configs and have the client and server share configs. Is there a detail I'm missing? On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546377#comment-14546377 ] Bikas Saha commented on YARN-1902: -- The AMRMClient was not written to automatically remove requests because it does not know which requests will be matched to allocated containers. The explicit contract is for users of AMRMClient to remove requests that have been matched to containers. If we change that behavior to automatically remove requests then it may lead to issues where 2 entities are removing requests. 1) user 2) AMRMClient. So that change should only be made in a different version of AMRMClient or else existing users will break. In the worst case, if the AMRMClient (automatically) removes the wrong request then the application will hang because the RM will not provide it the container that is needed. Not automatically removing the request has the downside of getting additional containers that need to be released by the application. We chose excess containers over hanging for the original implementation. Excess containers should happen rarely because the user controls when AMRMClient heartbeats to the RM and can do that after having removed all matched requests, so that the remote request table reflects the current state of outstanding requests. There may still be a race condition on the RM side that gives more containers. Excess containers can happen more often with AMRMClientAsync, because it heartbeats at a regular schedule and can send more requests than really outstanding if the heartbeat goes out before the user has removed the matched requests. Allocation of too many containers when a second request is done with the same resource capability - Key: YARN-1902 URL: https://issues.apache.org/jira/browse/YARN-1902 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0, 2.3.0, 2.4.0 Reporter: Sietse T. Au Assignee: Sietse T. Au Labels: client Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of MapResource, ResourceRequestInfo is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546421#comment-14546421 ] Bikas Saha commented on YARN-1902: -- Yes. And then the RM may give a container on H1 which is not useful for the app. If we again auto-decrement and release the container then we end up with 2 outstanding requests and the job will hang because it needs 3 containers. Allocation of too many containers when a second request is done with the same resource capability - Key: YARN-1902 URL: https://issues.apache.org/jira/browse/YARN-1902 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0, 2.3.0, 2.4.0 Reporter: Sietse T. Au Assignee: Sietse T. Au Labels: client Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of MapResource, ResourceRequestInfo is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3655) FairScheduler: potential livelock due to maxAMShare limitation and container reservation
[ https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546420#comment-14546420 ] zhihai xu commented on YARN-3655: - I uploaded a patch YARN-3655.000.patch for review. FairScheduler: potential livelock due to maxAMShare limitation and container reservation - Key: YARN-3655 URL: https://issues.apache.org/jira/browse/YARN-3655 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3655.000.patch FairScheduler: potential livelock due to maxAMShare limitation and container reservation. If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A livelock situation can happen. The following is the code at FSAppAttempt#assignContainer which can cause this potential livelock. {code} // Check the AM resource usage for the leaf queue if (!isAmRunning() !getUnmanagedAM()) { ListResourceRequest ask = appSchedulingInfo.getAllResourceRequests(); if (ask.isEmpty() || !getQueue().canRunAppAM( ask.get(0).getCapability())) { if (LOG.isDebugEnabled()) { LOG.debug(Skipping allocation because maxAMShare limit would + be exceeded); } return Resources.none(); } } {code} To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3662) Federation StateStore APIs
Subru Krishnan created YARN-3662: Summary: Federation StateStore APIs Key: YARN-3662 URL: https://issues.apache.org/jira/browse/YARN-3662 Project: Hadoop YARN Issue Type: Sub-task Reporter: Subru Krishnan Assignee: Subru Krishnan The Federation State defines the additional state that needs to be maintained to loosely couple multiple individual sub-clusters into a single large federated cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3663) Federation State and Policy Store (DBMS implementation)
Giovanni Matteo Fumarola created YARN-3663: -- Summary: Federation State and Policy Store (DBMS implementation) Key: YARN-3663 URL: https://issues.apache.org/jira/browse/YARN-3663 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Giovanni Matteo Fumarola This JIRA tracks a SQL-based implementation of the Federation State and Policy Store, which implements YARN-3662 APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546461#comment-14546461 ] zhihai xu commented on YARN-3591: - [~vinodkv], yes, keeping the ownership of turning disks good or bad in one single place is a very good suggestion. So it is reasonable to keep all the disk checking at DirectoryCollection. Normally CacheCleanup thread will periodically send CACHE_CLEANUP event to cleanup these localized files in LocalResourcesTrackerImpl. If we only remove these localized resources on the bad disk which can't be recovered, it will be ok. Here bad disk is different from full disk. I suppose all the files on the bad disk will be lost/deleted when it becomes good. Keeping app level resources sounds reasonable to me. Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3666) Federation Intercepting and propagating AM-RM communications
[ https://issues.apache.org/jira/browse/YARN-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kishore Chaliparambil updated YARN-3666: External issue ID: (was: YARN-2884) Federation Intercepting and propagating AM-RM communications Key: YARN-3666 URL: https://issues.apache.org/jira/browse/YARN-3666 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Kishore Chaliparambil In order, to support transparent spanning of jobs across sub-clusters, all AM-RM communications are proxied (via YARN-2884). This JIRA tracks federation-specific mechanisms that decide how to split/broadcast requests to the RMs and merge answers to the AM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546213#comment-14546213 ] Vinod Kumar Vavilapalli commented on YARN-3591: --- Essentially keeping the ownership of turning disks good or bad in one single place. Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546288#comment-14546288 ] Craig Welch commented on YARN-3626: --- [~cnauroth], [~vvasudev] - This patch goes back to the original approach I passed by you offline - the fix itself is the same, but it uses the classpath instead of configuration to determine when the behavior should change. Your thoughts? On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546359#comment-14546359 ] Vinod Kumar Vavilapalli commented on YARN-3626: --- bq. We may have to depend on some sort of a named environment variable or something, assuming adding a new field in CLC is not desirable. Can't we do the above? We definitely cannot insert mapreduce incantations like job.jar into YARN. On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546370#comment-14546370 ] Craig Welch commented on YARN-3626: --- bq. Can't we do the above? We definitely cannot insert mapreduce incantations like job.jar into YARN. That's why I took the config based approach, which apparently is invalid... but it also worked, which is quite confusing. I'm going to go back and validate our reasoning for believing it shoudn't. bq. Can't we do the above? We definitely cannot insert mapreduce incantations like job.jar into YARN. I suppose we can if it would work. It needs to be something which can be propagated from Oozie, which adds additional complexity. Ideally, we need something the MrApps can set based on the presence of the mapred configuration so that it propagates through. Do we have an example of this being done elsewhere? On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546393#comment-14546393 ] Vinod Kumar Vavilapalli commented on YARN-3626: --- bq. I thought this implemented per app behavior, because it could vary between MR app submission instances. It would not be a requirement to put YARN_APPLICATION_CLASSPATH_PREPEND_DISTCACHE into the server configs and have the client and server share configs. YARN doesn't have a notion of app-configs, it doesn't know app's config files etc. So, the app cannot set a config property that it expects the server to respect. No idea, how your original patch apparently worked. May be we are missing something. [~cwelch], what I was proposing was something in the lines of (a) user sets MR user-classpath-before config (b) MR converts that into a special env for YARN and (c) YARN looks at the ENV to figure out how to the order the classpath. Overall, it is terrible that we are talking classpaths in YARN, but that's for another JIRA. On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546400#comment-14546400 ] Chris Nauroth commented on YARN-3626: - I see now. Thanks for the clarification. In that case, I agree with the new proposal. On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3655) FairScheduler: potential livelock due to maxAMShare limitation and container reservation
[ https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3655: Attachment: YARN-3655.000.patch FairScheduler: potential livelock due to maxAMShare limitation and container reservation - Key: YARN-3655 URL: https://issues.apache.org/jira/browse/YARN-3655 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3655.000.patch FairScheduler: potential deadlock due to maxAMShare limitation and container reservation. If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A dead lock situation can happen. The following is the code at FSAppAttempt#assignContainer which can cause this potential dead lock. {code} // Check the AM resource usage for the leaf queue if (!isAmRunning() !getUnmanagedAM()) { ListResourceRequest ask = appSchedulingInfo.getAllResourceRequests(); if (ask.isEmpty() || !getQueue().canRunAppAM( ask.get(0).getCapability())) { if (LOG.isDebugEnabled()) { LOG.debug(Skipping allocation because maxAMShare limit would + be exceeded); } return Resources.none(); } } {code} To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3655) FairScheduler: potential livelock due to maxAMShare limitation and container reservation
[ https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3655: Description: FairScheduler: potential livelock due to maxAMShare limitation and container reservation. If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A livelock situation can happen. The following is the code at FSAppAttempt#assignContainer which can cause this potential livelock. {code} // Check the AM resource usage for the leaf queue if (!isAmRunning() !getUnmanagedAM()) { ListResourceRequest ask = appSchedulingInfo.getAllResourceRequests(); if (ask.isEmpty() || !getQueue().canRunAppAM( ask.get(0).getCapability())) { if (LOG.isDebugEnabled()) { LOG.debug(Skipping allocation because maxAMShare limit would + be exceeded); } return Resources.none(); } } {code} To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application. was: FairScheduler: potential deadlock due to maxAMShare limitation and container reservation. If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A dead lock situation can happen. The following is the code at FSAppAttempt#assignContainer which can cause this potential dead lock. {code} // Check the AM resource usage for the leaf queue if (!isAmRunning() !getUnmanagedAM()) { ListResourceRequest ask = appSchedulingInfo.getAllResourceRequests(); if (ask.isEmpty() || !getQueue().canRunAppAM( ask.get(0).getCapability())) { if (LOG.isDebugEnabled()) { LOG.debug(Skipping allocation because maxAMShare limit would + be exceeded); } return Resources.none(); } } {code} To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application. FairScheduler: potential livelock due to maxAMShare limitation and container reservation - Key: YARN-3655 URL: https://issues.apache.org/jira/browse/YARN-3655 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3655.000.patch FairScheduler: potential livelock due to maxAMShare limitation and container reservation. If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A livelock situation can happen. The following is the code at FSAppAttempt#assignContainer which can cause this potential livelock. {code} // Check the AM resource usage for the leaf queue if (!isAmRunning() !getUnmanagedAM()) { ListResourceRequest ask = appSchedulingInfo.getAllResourceRequests(); if (ask.isEmpty() || !getQueue().canRunAppAM( ask.get(0).getCapability())) { if (LOG.isDebugEnabled()) { LOG.debug(Skipping allocation because maxAMShare limit would + be exceeded); } return Resources.none(); } } {code} To fix this issue,
[jira] [Updated] (YARN-3659) Federation Router (hiding multiple RMs for ApplicationClientProtocol)
[ https://issues.apache.org/jira/browse/YARN-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola updated YARN-3659: --- Description: This JIRA tracks the design/implementation of the layer for routing ApplicationClientProtocol requests to the appropriate RM(s) in a federated YARN cluster. was: This JIRA tracks the design/implementation of the layer for routing ApplicaitonSubmissionProtocol requests to the appropriate RM(s) in a federated YARN cluster. Federation Router (hiding multiple RMs for ApplicationClientProtocol) - Key: YARN-3659 URL: https://issues.apache.org/jira/browse/YARN-3659 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Giovanni Matteo Fumarola This JIRA tracks the design/implementation of the layer for routing ApplicationClientProtocol requests to the appropriate RM(s) in a federated YARN cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3660) Federation Global Policy Generator (load balancing)
Carlo Curino created YARN-3660: -- Summary: Federation Global Policy Generator (load balancing) Key: YARN-3660 URL: https://issues.apache.org/jira/browse/YARN-3660 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Subru Krishnan In a federated environment, local impairments of one sub-cluster might unfairly affect users/queues that are mapped to that sub-cluster. A centralized component (GPG) runs out-of-band and edits the policies governing how users/queues are allocated to sub-clusters. This allows us to enforce global invariants (by dynamically updating locally-enforced invariants). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3659) Federation Router (hiding multiple RMs for ApplicationSubmissionProtocol)
Giovanni Matteo Fumarola created YARN-3659: -- Summary: Federation Router (hiding multiple RMs for ApplicationSubmissionProtocol) Key: YARN-3659 URL: https://issues.apache.org/jira/browse/YARN-3659 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Giovanni Matteo Fumarola This JIRA tracks the design/implementation of the layer for routing ApplicaitonSubmissionProtocol requests to the appropriate RM(s) in a federated YARN cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3659) Federation Router (hiding multiple RMs for ApplicationClientProtocol)
[ https://issues.apache.org/jira/browse/YARN-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-3659: - Summary: Federation Router (hiding multiple RMs for ApplicationClientProtocol) (was: Federation Router (hiding multiple RMs for ApplicationSubmissionProtocol)) Federation Router (hiding multiple RMs for ApplicationClientProtocol) - Key: YARN-3659 URL: https://issues.apache.org/jira/browse/YARN-3659 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Giovanni Matteo Fumarola This JIRA tracks the design/implementation of the layer for routing ApplicaitonSubmissionProtocol requests to the appropriate RM(s) in a federated YARN cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3664) Federation PolicyStore APIs
[ https://issues.apache.org/jira/browse/YARN-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan reassigned YARN-3664: Assignee: Subru Krishnan Federation PolicyStore APIs --- Key: YARN-3664 URL: https://issues.apache.org/jira/browse/YARN-3664 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Subru Krishnan Assignee: Subru Krishnan The federation Policy Store contains information about the capacity allocations made by users, their mapping to sub-clusters and the policies that each of the components (Router, AMRMPRoxy, RMs) should enforce -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3665) Federation subcluster membership mechanisms
Subru Krishnan created YARN-3665: Summary: Federation subcluster membership mechanisms Key: YARN-3665 URL: https://issues.apache.org/jira/browse/YARN-3665 Project: Hadoop YARN Issue Type: Sub-task Reporter: Subru Krishnan The member YARN RMs continuously heartbeat to the state store to keep alive and publish their current capability/load information. This JIRA tracks this mechanisms. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3663) Federation State and Policy Store (DBMS implementation)
[ https://issues.apache.org/jira/browse/YARN-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola reassigned YARN-3663: -- Assignee: Giovanni Matteo Fumarola Federation State and Policy Store (DBMS implementation) --- Key: YARN-3663 URL: https://issues.apache.org/jira/browse/YARN-3663 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Giovanni Matteo Fumarola Assignee: Giovanni Matteo Fumarola This JIRA tracks a SQL-based implementation of the Federation State and Policy Store, which implements YARN-3662 APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3661) Federation UI
[ https://issues.apache.org/jira/browse/YARN-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola reassigned YARN-3661: -- Assignee: Giovanni Matteo Fumarola Federation UI -- Key: YARN-3661 URL: https://issues.apache.org/jira/browse/YARN-3661 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Giovanni Matteo Fumarola Assignee: Giovanni Matteo Fumarola The UIs provided by each RM, provide a correct local view of what is running in a sub-cluster. In the context of federation we need new UIs that can track load, jobs, users across sub-clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3666) Federation Intercepting and propagating AM-RM communications
[ https://issues.apache.org/jira/browse/YARN-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kishore Chaliparambil reassigned YARN-3666: --- Assignee: Kishore Chaliparambil Federation Intercepting and propagating AM-RM communications Key: YARN-3666 URL: https://issues.apache.org/jira/browse/YARN-3666 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Kishore Chaliparambil Assignee: Kishore Chaliparambil In order, to support transparent spanning of jobs across sub-clusters, all AM-RM communications are proxied (via YARN-2884). This JIRA tracks federation-specific mechanisms that decide how to split/broadcast requests to the RMs and merge answers to the AM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3561) Non-AM Containers continue to run even after AM is stopped
[ https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546278#comment-14546278 ] Jian He commented on YARN-3561: --- [~gsaha], from the description, this is running against 2.6 ? this could be related to YARN-2825, but that's fixed in 2.6 From the logs, I can only see the container is still sort of waiting for the process to finish. Is this easy to reproduce? It'll be great if we have NM logs with debug level on. Non-AM Containers continue to run even after AM is stopped -- Key: YARN-3561 URL: https://issues.apache.org/jira/browse/YARN-3561 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.6.0 Environment: debian 7 Reporter: Gour Saha Priority: Critical Attachments: app0001.zip Non-AM containers continue to run even after application is stopped. This occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a Hadoop 2.6 deployment. Following are the NM logs from 2 different nodes: *host-07* - where Slider AM was running *host-03* - where Storm NIMBUS container was running. *Note:* The logs are partial, starting with the time when the relevant Slider AM and NIMBUS containers were allocated, till the time when the Slider AM was stopped. Also, the large number of Memory usage log lines were removed keeping only a few starts and ends of every segment. *NM log from host-07 where Slider AM container was running:* {noformat} 2015-04-29 00:39:24,614 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for container_1428575950531_0020_02_01 2015-04-29 00:41:10,310 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE) 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for container_1428575950531_0021_01_01 by user yarn 2015-04-29 00:41:10,322 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new application reference for app application_1428575950531_0021 2015-04-29 00:41:10,323 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from NEW to INITING 2015-04-29 00:41:10,325 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=yarn IP=10.84.105.162 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1428575950531_0021 CONTAINERID=container_1428575950531_0021_01_01 2015-04-29 00:41:10,328 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 2015-04-29 00:41:10,328 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:init(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-04-29 00:41:10,351 INFO application.Application (ApplicationImpl.java:transition(304)) - Adding container_1428575950531_0021_01_01 to application application_1428575950531_0021 2015-04-29 00:41:10,352 INFO application.Application (ApplicationImpl.java:handle(464)) - Application application_1428575950531_0021 transitioned from INITING to RUNNING 2015-04-29 00:41:10,356 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING 2015-04-29 00:41:10,357 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId application_1428575950531_0021 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,357 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar transitioned from INIT to DOWNLOADING 2015-04-29 00:41:10,358 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar transitioned from INIT to
[jira] [Updated] (YARN-3655) FairScheduler: potential livelock due to maxAMShare limitation and container reservation
[ https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3655: --- Summary: FairScheduler: potential livelock due to maxAMShare limitation and container reservation (was: FairScheduler: potential deadlock due to maxAMShare limitation and container reservation ) FairScheduler: potential livelock due to maxAMShare limitation and container reservation - Key: YARN-3655 URL: https://issues.apache.org/jira/browse/YARN-3655 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu FairScheduler: potential deadlock due to maxAMShare limitation and container reservation. If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A dead lock situation can happen. The following is the code at FSAppAttempt#assignContainer which can cause this potential dead lock. {code} // Check the AM resource usage for the leaf queue if (!isAmRunning() !getUnmanagedAM()) { ListResourceRequest ask = appSchedulingInfo.getAllResourceRequests(); if (ask.isEmpty() || !getQueue().canRunAppAM( ask.get(0).getCapability())) { if (LOG.isDebugEnabled()) { LOG.debug(Skipping allocation because maxAMShare limit would + be exceeded); } return Resources.none(); } } {code} To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546391#comment-14546391 ] Vinod Kumar Vavilapalli commented on YARN-1902: --- This was discussed multiple times before. Two kinds of races can happen. A resource-table deduction happens when # allocated containers are already sitting in the RM (tracked at YARN-110) # allocated containers are already sitting in the client library Seems like this JIRA is talking about both (1) and (2). The dist-shell example above sounds like it could be because of (1). Re (2), as Bikas says, the notion of forcing apps to deduct requests after a successful allocation (using AMRMClient.removeContainerRequest()) was introduced because the library clearly doesn't have an idea of which ResourceRequest to deduct from. [~leftnoteasy] mentioned offline that we could at-least deduct the count against the over-all number (ANY request) for a given priority. /cc [~bikassaha] Allocation of too many containers when a second request is done with the same resource capability - Key: YARN-1902 URL: https://issues.apache.org/jira/browse/YARN-1902 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0, 2.3.0, 2.4.0 Reporter: Sietse T. Au Assignee: Sietse T. Au Labels: client Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of MapResource, ResourceRequestInfo is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546396#comment-14546396 ] Hadoop QA commented on YARN-3632: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 25s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:red}-1{color} | javac | 7m 28s | The applied patch generated 1 additional warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 45s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 1m 17s | The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 50m 6s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 86m 10s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | | | Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS; locked 66% of time Unsynchronized access at FileSystemRMStateStore.java:66% of time Unsynchronized access at FileSystemRMStateStore.java:[line 156] | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733262/YARN-3632.5.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / f7e051c | | javac | https://builds.apache.org/job/PreCommit-YARN-Build/7955/artifact/patchprocess/diffJavacWarnings.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/7955/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7955/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7955/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7955/console | This message was automatically generated. Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3664) Federation PolicyStore APIs
Subru Krishnan created YARN-3664: Summary: Federation PolicyStore APIs Key: YARN-3664 URL: https://issues.apache.org/jira/browse/YARN-3664 Project: Hadoop YARN Issue Type: Sub-task Reporter: Subru Krishnan The federation Policy Store contains information about the capacity allocations made by users, their mapping to sub-clusters and the policies that each of the components (Router, AMRMPRoxy, RMs) should enforce -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3666) Federation Intercepting and propagating AM-RM communications
Kishore Chaliparambil created YARN-3666: --- Summary: Federation Intercepting and propagating AM-RM communications Key: YARN-3666 URL: https://issues.apache.org/jira/browse/YARN-3666 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Kishore Chaliparambil In order, to support transparent spanning of jobs across sub-clusters, all AM-RM communications are proxied (via YARN-2884). This JIRA tracks federation-specific mechanisms that decide how to split/broadcast requests to the RMs and merge answers to the AM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546209#comment-14546209 ] MENG DING commented on YARN-1902: - I was almost going to log the same issue when I saw this thread (and also YARN-3020) :-). After reading all the discussions, and after reading the related code, I still believe this is a bug. I understand what [~bikassaha] has said that the AM-RM protocol is NOT a delta protocol, and that currently user (i.e., ApplicationMaster) is responsible for calling removeContainerRequest() after receiving an allocation, but consider the follow simple modification to the packaged *distributedshell* application: {code:title=ApplicationMaster.java|borderStyle=solid} --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java @@ -805,6 +805,8 @@ public void onContainersAllocated(ListContainer allocatedContainers) { // as all containers may not be allocated at one go. launchThreads.add(launchThread); launchThread.start(); +ContainerRequest containerAsk = setupContainerAskForRM(); +amRMClient.removeContainerRequest(containerAsk); } } {code} The code simply removes a container request after successfully receiving an allocated container in the ApplicationMaster. When you submit this application by specifying, say, 3 containers on the CLI, you will sometimes get 4 containers allocated (not counting the AM container)! {code} root@node2:~# hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar /usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.0.0-SNAPSHOT.jar -shell_command sleep 10 -num_containers 3 -timeout 2 {code} {code} root@node2:~# yarn container -list appattempt_1431531743796_0015_01 15/05/15 20:49:01 INFO client.RMProxy: Connecting to ResourceManager at node2/10.211.55.102:8032 15/05/15 20:49:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Total number of containers :5 Container-IdStart Time Finish Time StateHost Node Http Address LOG-URL container_1431531743796_0015_01_05 Fri May 15 20:44:12 + 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_05/root container_1431531743796_0015_01_01 Fri May 15 20:44:06 + 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_01/root container_1431531743796_0015_01_02 Fri May 15 20:44:10 + 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_02/root container_1431531743796_0015_01_04 Fri May 15 20:44:11 + 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_04/root container_1431531743796_0015_01_03 Fri May 15 20:44:10 + 2015 N/A RUNNING node4:41128 http://node4:8042 http://node4:8042/node/containerlogs/container_1431531743796_0015_01_03/root {code} The *fundamental* problem here, I believe, is that the AMRMClient maintains an internal request table *remoteRequestsTable* that keeps track of *total* container requests (i.e., including container requests that have been satisfied, and that are not yet satisfied): {code:title=AMRMClient.java|borderStyle=solid} protected final MapPriority, MapString, TreeMapResource, ResourceRequestInfo remoteRequestsTable = new TreeMapPriority, MapString, TreeMapResource, ResourceRequestInfo(); {code} However, the corresponding table *requests* at the scheduler side (inside AppSchedulingInfo.java) keeps track of *outstanding* container requests (i.e., container requests that are not yet satisfied): {code:title=AppSchedulingInfo.java|borderStyle=solid} final MapPriority, MapString, ResourceRequest requests = new ConcurrentHashMapPriority, MapString, ResourceRequest(); {code} Every time an allocation is successfully made, the decResourceRequest() or decrementOutstanding() call will update the *requests* table so that it only
[jira] [Updated] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3626: -- Attachment: YARN-3626.9.patch In that case, here's a patch which goes back to the original approach used during troubleshooting, which uses the classpath itself to communicate the difference (it only touches other code to revert parts of the earlier patch no longer needed, the actual change, when done this way, is solely in ContainerLaunch.java, and it makes the conditional determination based on the classpath differences already present due to the manipulation earlier in the chain, in this case, by mapreduce due to user.classpath.first) On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546361#comment-14546361 ] Ishai Menache commented on YARN-3656: - LowCost judiciously “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. This leads to more balanced allocations, and in turn substantially improves the acceptance rate of jobs and the cluster utilization. LowCost: A Cost-Based Placement Agent for YARN Reservations --- Key: YARN-3656 URL: https://issues.apache.org/jira/browse/YARN-3656 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Ishai Menache YARN-1051 enables SLA support by allowing users to reserve cluster capacity ahead of time. YARN-1710 introduced a greedy agent for placing user reservations. The greedy agent makes fast placement decisions but at the cost of ignoring the cluster committed resources, which might result in blocking the cluster resources for certain periods of time, and in turn rejecting some arriving jobs. We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546403#comment-14546403 ] Craig Welch commented on YARN-3632: --- findbugs and javac appear to be irrelevant... Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3632.0.patch, YARN-3632.1.patch, YARN-3632.3.patch, YARN-3632.4.patch, YARN-3632.5.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546418#comment-14546418 ] Vinod Kumar Vavilapalli commented on YARN-1902: --- bq. Wangda Tan mentioned offline that we could at-least deduct the count against the over-all number (ANY request) for a given priority. Further thought tells me this is not desired in some cases as well. Take the following example. User originally wants: 1 container on H1, 1 container on H2, and 2 containers on R1 (rack). The request table becomes |H1|1| |H2|1| |R1|2| |*|4| Now assuming RM returns a container on R2 (rack), auto-decrementing the request table will make it |H1|1| |H2|1| |R1|2| |*|3| But user may actually want something like the following. This depends on what the user preferences are w.r.t scheduling. |H1|0| |H2|1| |R1|2| |*|3| Allocation of too many containers when a second request is done with the same resource capability - Key: YARN-1902 URL: https://issues.apache.org/jira/browse/YARN-1902 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0, 2.3.0, 2.4.0 Reporter: Sietse T. Au Assignee: Sietse T. Au Labels: client Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of MapResource, ResourceRequestInfo is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3657) Federation maintenance mechanisms (simple CLI and command propagation)
Carlo Curino created YARN-3657: -- Summary: Federation maintenance mechanisms (simple CLI and command propagation) Key: YARN-3657 URL: https://issues.apache.org/jira/browse/YARN-3657 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino The maintenance mechanisms provided by the RM are not sufficient in a federated environment. In this JIRA we track few extensions (more to come later) to allow basic maintenance mechanisms (and command propagation) for the federated components. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3658) Federation Capacity Allocation across sub-cluster
Carlo Curino created YARN-3658: -- Summary: Federation Capacity Allocation across sub-cluster Key: YARN-3658 URL: https://issues.apache.org/jira/browse/YARN-3658 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino This JIRA will track mechanisms to map federation level capacity allocations to sub-cluster level ones. (Possibly via reservation mechanisms). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3665) Federation subcluster membership mechanisms
[ https://issues.apache.org/jira/browse/YARN-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan reassigned YARN-3665: Assignee: Subru Krishnan Federation subcluster membership mechanisms --- Key: YARN-3665 URL: https://issues.apache.org/jira/browse/YARN-3665 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Subru Krishnan Assignee: Subru Krishnan The member YARN RMs continuously heartbeat to the state store to keep alive and publish their current capability/load information. This JIRA tracks this mechanisms. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546467#comment-14546467 ] Carlo Curino commented on YARN-3656: I worked with Ishai and Jonathan closely on this, and the integration with YARN-1051 is done rather carefully. After a month running experiments they confirmed consistently better performance on all the key metrics and reasonable runtimes. I would argue that after a careful code review and some more testing, the LowCost agent he propose should become our default for use of reservations, as dominates the greedy agent we have today. LowCost: A Cost-Based Placement Agent for YARN Reservations --- Key: YARN-3656 URL: https://issues.apache.org/jira/browse/YARN-3656 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Ishai Menache Attachments: LowCostRayonExternal.pdf YARN-1051 enables SLA support by allowing users to reserve cluster capacity ahead of time. YARN-1710 introduced a greedy agent for placing user reservations. The greedy agent makes fast placement decisions but at the cost of ignoring the cluster committed resources, which might result in blocking the cluster resources for certain periods of time, and in turn rejecting some arriving jobs. We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3652) A SchedulerMetrics may be need for evaluating the scheduler's performance
[ https://issues.apache.org/jira/browse/YARN-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545607#comment-14545607 ] Xianyin Xin commented on YARN-3652: --- Thanks for comments, [~sunilg]. {quote} 1. *Throughput* : Are you mentioning about #events processed over a period of time? If so, how can we set the timeline by which throughput is calculated (configurable?)? A clear indicator from this will be like we can predict possible end timeline for the pending events in dispatcher queue. Adding throughput with #no of pending events may give much more better indication about RM overload. {quote} In fact the first comes in my mind is the #containers allocated by scheduler per second, because the containers allocation what users care and the node update event is the most important scheduler event. The rate of processing events is also a nice indicator, just as you comment. {quote} 2. However there are many events coming to scheduler, if possible a filter for the events based on events type may be helpful to give an accuracy for throughout and scheduling delay. {quote} +1 for the idea. Besides, the #events processed by scheduler per second is large, so the indexes based on this is volatile. We may consider some method to smooth the fluctuate, like making sampling or statistics. A SchedulerMetrics may be need for evaluating the scheduler's performance - Key: YARN-3652 URL: https://issues.apache.org/jira/browse/YARN-3652 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Reporter: Xianyin Xin As discussed in YARN-3630, a {{SchedulerMetrics}} may be need for evaluating the scheduler's performance. The performance indexes includes #events waiting for being handled by scheduler, the throughput, the scheduling delay and/or other indicators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545567#comment-14545567 ] Hudson commented on YARN-3505: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2126 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2126/]) YARN-3505. Node's Log Aggregation Report with SUCCEED should not cached in RMApps. Contributed by Xuan Gong. (junping_du: rev 15ccd967ee3e7046a50522089f67ba01f36ec76a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/LogAggregationReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/LogAggregationReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppLogAggregationStatusBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatRequest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStatusEvent.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1519) check if sysconf is implemented before using it
[ https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545572#comment-14545572 ] Hudson commented on YARN-1519: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2126 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2126/]) YARN-1519. Check in container-executor if sysconf is implemented before using it (Radim Kolar and Eric Payne via raviprak) (raviprak: rev 53fe4eff09fdaeed75a8cad3a26156bf963a8d37) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c check if sysconf is implemented before using it --- Key: YARN-1519 URL: https://issues.apache.org/jira/browse/YARN-1519 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.3.0 Reporter: Radim Kolar Assignee: Radim Kolar Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, nodemgr-sysconf.txt If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to segfault because invalid pointer gets passed to libc function. fix: enforce minimum value 1024, same method is used in hadoop-common native code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545597#comment-14545597 ] Hudson commented on YARN-3505: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #186 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/186/]) YARN-3505. Node's Log Aggregation Report with SUCCEED should not cached in RMApps. Contributed by Xuan Gong. (junping_du: rev 15ccd967ee3e7046a50522089f67ba01f36ec76a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/LogAggregationReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppLogAggregationStatusBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStatusEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatRequest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/LogAggregationReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Fix For: 2.8.0 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch, YARN-3505.6.patch, YARN-3505.addendum.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1519) check if sysconf is implemented before using it
[ https://issues.apache.org/jira/browse/YARN-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545602#comment-14545602 ] Hudson commented on YARN-1519: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #186 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/186/]) YARN-1519. Check in container-executor if sysconf is implemented before using it (Radim Kolar and Eric Payne via raviprak) (raviprak: rev 53fe4eff09fdaeed75a8cad3a26156bf963a8d37) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c check if sysconf is implemented before using it --- Key: YARN-1519 URL: https://issues.apache.org/jira/browse/YARN-1519 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.3.0 Reporter: Radim Kolar Assignee: Radim Kolar Labels: BB2015-05-TBR Fix For: 2.8.0 Attachments: YARN-1519.002.patch, YARN-1519.003.patch, nodemgr-sysconf.txt If sysconf value _SC_GETPW_R_SIZE_MAX is not implemented, it leads to segfault because invalid pointer gets passed to libc function. fix: enforce minimum value 1024, same method is used in hadoop-common native code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2915) Enable YARN RM scale out via federation using multiple RM's
[ https://issues.apache.org/jira/browse/YARN-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2915: - Attachment: Yarn_federation_design_v1.pdf Uploading a proposal of design based on offline design discussions within our team and with [~kasha], [~adhoot], [~vinodkv], [~acmurthy], [~tucu00] and more people (apologize if I missed anyone). We validated the proposed design by developing a prototype and we have a basic end2end functioning system where we can stitch multiple YARN clusters into a unified federated cluster and run jobs that transparently span across all of them. Enable YARN RM scale out via federation using multiple RM's --- Key: YARN-2915 URL: https://issues.apache.org/jira/browse/YARN-2915 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao Assignee: Subru Krishnan Attachments: Yarn_federation_design_v1.pdf This is an umbrella JIRA that proposes to scale out YARN to support large clusters comprising of tens of thousands of nodes. That is, rather than limiting a YARN managed cluster to about 4k in size, the proposal is to enable the YARN managed cluster to be elastically scalable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3655) FairScheduler: potential livelock due to maxAMShare limitation and container reservation
[ https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546508#comment-14546508 ] Hadoop QA commented on YARN-3655: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 31s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 53s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 1m 19s | The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 60m 16s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 96m 47s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | | | Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS; locked 66% of time Unsynchronized access at FileSystemRMStateStore.java:66% of time Unsynchronized access at FileSystemRMStateStore.java:[line 156] | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12733282/YARN-3655.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8f37873 | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/7956/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7956/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7956/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7956/console | This message was automatically generated. FairScheduler: potential livelock due to maxAMShare limitation and container reservation - Key: YARN-3655 URL: https://issues.apache.org/jira/browse/YARN-3655 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3655.000.patch FairScheduler: potential livelock due to maxAMShare limitation and container reservation. If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A livelock situation can happen. The following is the code at FSAppAttempt#assignContainer which can cause this potential livelock. {code} // Check the AM resource usage for the leaf queue if (!isAmRunning() !getUnmanagedAM()) { ListResourceRequest ask = appSchedulingInfo.getAllResourceRequests(); if (ask.isEmpty() || !getQueue().canRunAppAM( ask.get(0).getCapability())) { if (LOG.isDebugEnabled()) { LOG.debug(Skipping allocation because maxAMShare limit would + be exceeded); } return Resources.none(); } } {code} To fix
[jira] [Updated] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3626: -- Attachment: YARN-3626.11.patch Now using the environment to pass the configuration. On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.11.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String
[ https://issues.apache.org/jira/browse/YARN-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3565: Attachment: YARN-3565.20150516-1.patch Hi [~wangda], Uploading a patch with fixing applicable [~vinodkv] comments and ones which are not addressed are : * ??Not directly related to this patch, but in LabelsToNodeIdsProto?? : As discussed offline will be handled in YARN-3583 * ??ResourceTrackerService shouldn't have convertToStringSet(). RMNodeLabelsManager.replaceLabelsOnNode() etc.. should be modified to use the NodeLabel object??: As per the above [~wangda] comment not required now. * ??NodeReportProto also have a string for node_labels instead of an object.?? As discussed offline will be handled in YARN-3583 NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String - Key: YARN-3565 URL: https://issues.apache.org/jira/browse/YARN-3565 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Priority: Blocker Attachments: YARN-3565-20150502-1.patch, YARN-3565.20150515-1.patch, YARN-3565.20150516-1.patch Now NM HB/Register uses SetString, it will be hard to add new fields if we want to support specifying NodeLabel type such as exclusivity/constraints, etc. We need to make sure rolling upgrade works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)