[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598922#comment-14598922 ] Carlo Curino commented on YARN-3656: [~jyaniv] please address the checkstyle and whitespace -1 above. The rest is looking good. [~subru] can you comment on the test failure? Is this something that is going to be addressed by the work on making the reservation subsystem HA? > LowCost: A Cost-Based Placement Agent for YARN Reservations > --- > > Key: YARN-3656 > URL: https://issues.apache.org/jira/browse/YARN-3656 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Ishai Menache >Assignee: Jonathan Yaniv > Labels: capacity-scheduler, resourcemanager > Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, > YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf > > > YARN-1051 enables SLA support by allowing users to reserve cluster capacity > ahead of time. YARN-1710 introduced a greedy agent for placing user > reservations. The greedy agent makes fast placement decisions but at the cost > of ignoring the cluster committed resources, which might result in blocking > the cluster resources for certain periods of time, and in turn rejecting some > arriving jobs. > We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” > the demand of the job throughout the allowed time-window according to a > global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3800) Simplify inmemory state for ReservationAllocation
[ https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598860#comment-14598860 ] Hadoop QA commented on YARN-3800: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 17s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 48s | The applied patch generated 1 new checkstyle issues (total was 54, now 49). | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 56s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 89m 23s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741412/YARN-3800.004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 49dfad9 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8330/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8330/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8330/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8330/console | This message was automatically generated. > Simplify inmemory state for ReservationAllocation > - > > Key: YARN-3800 > URL: https://issues.apache.org/jira/browse/YARN-3800 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3800.001.patch, YARN-3800.002.patch, > YARN-3800.002.patch, YARN-3800.003.patch, YARN-3800.004.patch > > > Instead of storing the ReservationRequest we store the Resource for > allocations, as thats the only thing we need. Ultimately we convert > everything to resources anyway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598786#comment-14598786 ] Hadoop QA commented on YARN-3656: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 0s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 8m 1s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 5s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 50s | The applied patch generated 2 new checkstyle issues (total was 12, now 12). | | {color:red}-1{color} | whitespace | 0m 3s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 39s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 51m 55s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 92m 4s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741406/YARN-3656-v1.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 49dfad9 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8329/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8329/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8329/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8329/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8329/console | This message was automatically generated. > LowCost: A Cost-Based Placement Agent for YARN Reservations > --- > > Key: YARN-3656 > URL: https://issues.apache.org/jira/browse/YARN-3656 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Ishai Menache >Assignee: Jonathan Yaniv > Labels: capacity-scheduler, resourcemanager > Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, > YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf > > > YARN-1051 enables SLA support by allowing users to reserve cluster capacity > ahead of time. YARN-1710 introduced a greedy agent for placing user > reservations. The greedy agent makes fast placement decisions but at the cost > of ignoring the cluster committed resources, which might result in blocking > the cluster resources for certain periods of time, and in turn rejecting some > arriving jobs. > We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” > the demand of the job throughout the allowed time-window according to a > global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598781#comment-14598781 ] Ted Yu commented on YARN-3815: -- [~jrottinghuis]: Your description makes sense. Cell tag is supported since hbase 0.98+ so we can use it to mark completion. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598759#comment-14598759 ] Joep Rottinghuis commented on YARN-3815: Thanks [~ted_yu] for that link. I did find that code and I'm reading through it. Yes it uses a coprocessor on the reading side to "collapse" values together and permanently "collapse" them together on compaction. I want to use a similar approach here. We cannot use the delta write directly as-is for the following reasons: - For running applications, if we wanted to write only the increment the AM (or ATS writer) will have to keep track of the previous values in order to write the increment only. When the AM crashes and/or the ATS writer restarts we won't know what previous value we had written (and what has already been aggregated. So, we'd have to write the increment plus the latest value. - Ergo, why don't we just write the latest value to begin with and leave off the increment. Now we cannot "collapse" the deltas / latest value until the application is done. Otherwise we would again loose track of what was previously aggregated. So the new approach would be to write the latest value for an app and indicate (using a cell tag) that the app is done and that it can be a collapsed. We would use the co-processor only on the read-side just like with the delta write and that co-processor would aggregate values on the fly for reads and collapse during writes. Those writes would be limited to one single row, so we wouldn't have any weird cross-region locking issues, nor delays and hickups in the write throughput. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
[ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598710#comment-14598710 ] Xuan Gong commented on YARN-221: I think that we could have this configuration {code} yarn.container-log-aggregation-policy.class org.apache.hadoop.yarn.container-log-aggregation-policy.SampleRateContainerLogAggregationPolicy {code} which can be used as default log-aggregation-policy. If the users do not specify the policy class in ASC, the default policy will be used But maybe we do not need this one to specify the policy parameters: {code} yarn.container-log-aggregation-policy.class.SampleRateContainerLogAggregationPolicy SR:0.2 {code} Instead, we could set the default value for the policy. Also, in AppLogAggregator.java (From NM), after we parse the policy from ASC, we should do ContainerLogAggregationPolicy.parseParamter(ASC.logAggregationContext.getParamters()). Others are fine to me. > NM should provide a way for AM to tell it not to aggregate logs. > > > Key: YARN-221 > URL: https://issues.apache.org/jira/browse/YARN-221 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager >Reporter: Robert Joseph Evans >Assignee: Ming Ma > Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, > YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch > > > The NodeManager should provide a way for an AM to tell it that either the > logs should not be aggregated, that they should be aggregated with a high > priority, or that they should be aggregated but with a lower priority. The > AM should be able to do this in the ContainerLaunch context to provide a > default value, but should also be able to update the value when the container > is released. > This would allow for the NM to not aggregate logs in some cases, and avoid > connection to the NN at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3843) Fair Scheduler should not accept apps with space keys as queue name
[ https://issues.apache.org/jira/browse/YARN-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598674#comment-14598674 ] zhihai xu commented on YARN-3843: - [~dongwook], thanks for the confirmation! > Fair Scheduler should not accept apps with space keys as queue name > --- > > Key: YARN-3843 > URL: https://issues.apache.org/jira/browse/YARN-3843 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.0, 2.5.0 >Reporter: Dongwook Kwon >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3843.01.patch > > > As YARN-461, since empty string queue name is not valid, queue name with > space keys such as " " ," " should not be accepted either, also not as > prefix nor postfix. > e.g) "root.test.queuename ", or "root.test. queuename" > I have 2 specific cases kill RM with these space keys as part of queue name. > 1) Without placement policy (hadoop 2.4.0 and above), > When a job is submitted with " "(space key) as queue name > e.g) mapreduce.job.queuename=" " > 2) With placement policy (hadoop 2.5.0 and above) > Once a job is submitted without space key as queue name, and submit another > job with space key. > e.g) 1st time: mapreduce.job.queuename="root.test.user1" > 2nd time: mapreduce.job.queuename="root.test.user1 " > {code} > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.974 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler > testQueueNameWithSpace(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) > Time elapsed: 0.724 sec <<< ERROR! > org.apache.hadoop.metrics2.MetricsException: Metrics source > QueueMetrics,q0=root,q1=adhoc,q2=birvine already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:218) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:96) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.(FSQueue.java:56) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.(FSLeafQueue.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createQueue(QueueManager.java:169) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:120) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:88) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:660) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:569) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1127) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testQueueNameWithSpace(TestFairScheduler.java:627) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
[ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598663#comment-14598663 ] Ming Ma commented on YARN-221: -- Thanks [~xgong]. How about the followings? * Allow applications to specify the policy parameter via LogAggregationContext along with the policy class. {noformat} public abstract class LogAggregationContext { public void setContainerLogPolicyClass(Class logPolicy); public Class getContainerLogPolicyClass(); public void setParameters(String parameters); public String getParameters(); } {noformat} * NM uses default cluster-wide settings via the following configurations. MR can override these configurations on per-application basis. {noformat} yarn.container-log-aggregation-policy.class org.apache.hadoop.yarn.container-log-aggregation-policy.SampleRateContainerLogAggregationPolicy yarn.container-log-aggregation-policy.class.SampleRateContainerLogAggregationPolicy SR:0.2 {noformat} * To support per-application policy, modify MR YarnRunner. We can also modify YarnClientImpl to read these configurations and set ApplicationSubmissionContext accordingly. * The log aggregation policy object loaded in NM can be shared among different applications as long as they belong to same policy class with the same parameters. > NM should provide a way for AM to tell it not to aggregate logs. > > > Key: YARN-221 > URL: https://issues.apache.org/jira/browse/YARN-221 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager >Reporter: Robert Joseph Evans >Assignee: Ming Ma > Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, > YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch > > > The NodeManager should provide a way for an AM to tell it that either the > logs should not be aggregated, that they should be aggregated with a high > priority, or that they should be aggregated but with a lower priority. The > AM should be able to do this in the ContainerLaunch context to provide a > default value, but should also be able to update the value when the container > is released. > This would allow for the NM to not aggregate logs in some cases, and avoid > connection to the NN at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3800) Simplify inmemory state for ReservationAllocation
[ https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3800: Attachment: YARN-3800.004.patch > Simplify inmemory state for ReservationAllocation > - > > Key: YARN-3800 > URL: https://issues.apache.org/jira/browse/YARN-3800 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3800.001.patch, YARN-3800.002.patch, > YARN-3800.002.patch, YARN-3800.003.patch, YARN-3800.004.patch > > > Instead of storing the ReservationRequest we store the Resource for > allocations, as thats the only thing we need. Ultimately we convert > everything to resources anyway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3800) Simplify inmemory state for ReservationAllocation
[ https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598647#comment-14598647 ] Anubhav Dhoot commented on YARN-3800: - Addressed feedback > Simplify inmemory state for ReservationAllocation > - > > Key: YARN-3800 > URL: https://issues.apache.org/jira/browse/YARN-3800 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3800.001.patch, YARN-3800.002.patch, > YARN-3800.002.patch, YARN-3800.003.patch, YARN-3800.004.patch > > > Instead of storing the ReservationRequest we store the Resource for > allocations, as thats the only thing we need. Ultimately we convert > everything to resources anyway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598623#comment-14598623 ] Jonathan Yaniv commented on YARN-3656: -- Thanks Carlo. I attached a new version of the patch (v1.1), in which we also implement GreedyReservationAgent using our algorithmic framework. We verified that the behavior of the new version is identical to the original via simulations (= the implementations generated identical allocations) and unit tests (= the implementations behaved similarly on corner cases). We also ran test-patch locally on v1.1 of the patch and got +1. > LowCost: A Cost-Based Placement Agent for YARN Reservations > --- > > Key: YARN-3656 > URL: https://issues.apache.org/jira/browse/YARN-3656 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Ishai Menache >Assignee: Jonathan Yaniv > Labels: capacity-scheduler, resourcemanager > Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, > YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf > > > YARN-1051 enables SLA support by allowing users to reserve cluster capacity > ahead of time. YARN-1710 introduced a greedy agent for placing user > reservations. The greedy agent makes fast placement decisions but at the cost > of ignoring the cluster committed resources, which might result in blocking > the cluster resources for certain periods of time, and in turn rejecting some > arriving jobs. > We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” > the demand of the job throughout the allowed time-window according to a > global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Yaniv updated YARN-3656: - Attachment: YARN-3656-v1.1.patch > LowCost: A Cost-Based Placement Agent for YARN Reservations > --- > > Key: YARN-3656 > URL: https://issues.apache.org/jira/browse/YARN-3656 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Ishai Menache >Assignee: Jonathan Yaniv > Labels: capacity-scheduler, resourcemanager > Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, > YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf > > > YARN-1051 enables SLA support by allowing users to reserve cluster capacity > ahead of time. YARN-1710 introduced a greedy agent for placing user > reservations. The greedy agent makes fast placement decisions but at the cost > of ignoring the cluster committed resources, which might result in blocking > the cluster resources for certain periods of time, and in turn rejecting some > arriving jobs. > We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” > the demand of the job throughout the allowed time-window according to a > global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3800) Simplify inmemory state for ReservationAllocation
[ https://issues.apache.org/jira/browse/YARN-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598612#comment-14598612 ] Subru Krishnan commented on YARN-3800: -- Thanks [~adhoot] for the updated patch. Overall it looks good, a few minor nits: * Can we rename _ReservationUtil_ to _ReservationSystemUtil_ to avoid confusion. * In _TestInMemoryPlan_, can we use *allocations* instead of *allocs* to minimize the diff. * In _TestInMemoryReservationAllocation_, we can continue using the previous constructor for non-gang allocations as the flag is required only for gang. * There is a redundant format change in _TestInMemoryReservationAllocation_ : bq. -Assert.assertEquals(allocations, rAllocation.getAllocationRequests()); +Assert.assertEquals(allocations, +rAllocation.getAllocationRequests()); > Simplify inmemory state for ReservationAllocation > - > > Key: YARN-3800 > URL: https://issues.apache.org/jira/browse/YARN-3800 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3800.001.patch, YARN-3800.002.patch, > YARN-3800.002.patch, YARN-3800.003.patch > > > Instead of storing the ReservationRequest we store the Resource for > allocations, as thats the only thing we need. Ultimately we convert > everything to resources anyway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598590#comment-14598590 ] Sangjin Lee commented on YARN-3815: --- Moving from offline discussions... Now aggregation of *time series metrics* is rather tricky, and needs to be defined. Would an aggregated metric (e.g. at the flow level) of time series metrics (e.g. at the app level) be a time series itself? I see several problems with defining that as a time series. Individual app time series may be sampled at different times, and it's not clear what time series the aggregated flow metric would be. I think it might be simpler to say that an aggregated flow metric of time series may not need to be a time series itself. On the one hand, there is a general issue of at what time the aggregated values belong, regardless of whether they are time series or not. If all leaf values are recorded at the same time, it would be unambiguous. The aggregated metric value is of the same time. However, it is rarely the case. I think the current implicit behavior in hadoop is simply to take the latest values and add them up. One example is the MR counters (task level and job level). The task level counters are obtained at different times. Still, the corresponding job counters are simply sums of all the latest task counters, although they may have been taken at different times. We're basically taking that as an approximation that's "good enough". In the end, the final numbers will become accurate. In other words, the final values would truly be the accurate aggregate values. The time series basically adds another wrinkle to this. In case of a simple value, the final values are going to be correct, so this problem is less of an issue, but time series will retain intermediate values. Furthermore, their publishing interval may have no relationship with the publishing interval of the leaf values. I think the baseline approach should be either (1) do not use time series for the aggregated metrics, or (2) just to the best effort approximation by adding up the latest leaf values and store it with its own timestamp. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598577#comment-14598577 ] Sangjin Lee commented on YARN-3815: --- {quote} About flow online aggregation, I am not quite sure on requirement yet. Do we really want real time for flow aggregated data or some fine-grained time interval (like 15 secs) should be good enough - if we want to show some nice metrics chart for flow, this should be fine. {quote} Yes, I agree with that. When I said "real time", it doesn't mean real time in the sense that every metric is accurate to the second. Most likely raw data themselves (e.g. container data) are written on an interval anyway. Some type of time interval for aggregation is implied. {quote} Any special reason not to handle it in the same way above - as HBase coprocessor? It just sound like gross-grained time interval. Isn't it? {quote} I do see your point in that what I called the "real time" aggregation can be considered the same type of aggregation as the "offline" aggregation only on a shorter time interval. However, we also need to think about the use cases of such aggregated data. The former type of aggregation is very much something that can be plugged into UI such as the RM UI or ambari to show more immediate data. These data may change as the user refreshes the UI. So this is closer to the raw data. On the other hand, the latter type of aggregation lends itself to more analytical and ad-hoc analysis of data. These can be used for calculating chargebacks, usage trending, reporting, etc. Perhaps it could even contain more detailed info than the "real time" aggregated data for the reporting/data mining purposes. And that's where we would like to consider using phoenix to enable arbitrary ad-hoc SQL queries. One analogy [~jrottinghuis] brings up regarding this is OLTP v. OLAP. That's why we also think it makes sense to do only "offline" (time-based) aggregation for users and queues. At least in our case with hRaven, there hasn't been a compelling reason to show user- or queue-aggregated data in semi-real time. It has been perfectly adequate to show time-based aggregation, as data like this tend to be used more for reporting and analysis. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598565#comment-14598565 ] Hadoop QA commented on YARN-3069: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 21m 25s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 48s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 52s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 3m 4s | Site still builds. | | {color:green}+1{color} | checkstyle | 2m 4s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 23m 11s | Tests passed in hadoop-common. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | | | 75m 23s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741380/YARN-3069.013.patch | | Optional Tests | site javadoc javac unit findbugs checkstyle | | git revision | trunk / 122cad6 | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8328/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8328/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8328/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8328/console | This message was automatically generated. > Document missing properties in yarn-default.xml > --- > > Key: YARN-3069 > URL: https://issues.apache.org/jira/browse/YARN-3069 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: BB2015-05-TBR, supportability > Attachments: YARN-3069.001.patch, YARN-3069.002.patch, > YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, > YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, > YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch, > YARN-3069.012.patch, YARN-3069.013.patch > > > The following properties are currently not defined in yarn-default.xml. > These properties should either be > A) documented in yarn-default.xml OR > B) listed as an exception (with comments, e.g. for internal use) in the > TestYarnConfigurationFields unit test > Any comments for any of the properties below are welcome. > org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker > org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore > security.applicationhistory.protocol.acl > yarn.app.container.log.backups > yarn.app.container.log.dir > yarn.app.container.log.filesize > yarn.client.app-submission.poll-interval > yarn.client.application-client-protocol.poll-timeout-ms > yarn.is.minicluster > yarn.log.server.url > yarn.minicluster.control-resource-monitoring > yarn.minicluster.fixed.ports > yarn.minicluster.use-rpc > yarn.node-labels.fs-store.retry-policy-spec > yarn.node-labels.fs-store.root-dir > yarn.node-labels.manager-class > yarn.nodemanager.container-executor.os.sched.priority.adjustment > yarn.nodemanager.container-monitor.process-tree.class > yarn.nodemanager.disk-health-checker.enable > yarn.nodemanager.docker-container-executor.image-name > yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms > yarn.nodemanager.linux-container-executor.group > yarn.nodemanager.log.deletion-threads-count > yarn.nodemanager.user-home-dir > yarn.nodemanager.webapp.https.address > yarn.nodemanager.webapp.spnego-keytab-file > yarn.nodemanager.webapp.spnego-principal > yarn.nodemanager.windows-secure-container-executor.group
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598556#comment-14598556 ] Sangjin Lee commented on YARN-3815: --- {quote} AM currently leverage YARN's AppTimelineCollector to forward entities to backend storage, so making AM talk directly to backend storage is not considered to be safe. {quote} Just to be clear, I'm *not* proposing AMs writing directly to the backend storage. AMs continue to write through the app-level timeline collector. My proposal is that the AMs are responsible for setting the aggregated framework-specific metric values on the *YARN application entities*. Let's consider the example of MR. MR itself would have its own entities such as job, tasks, and task attempts. These are distinct entities from the YARN entities such as application, app attempts, and containers. We can either (1) have the MR AM set framework-specific metric values at the YARN container entities and have YARN aggregate them to applications, or (2) have the MR AM set the aggregated values on the applications for itself. I feel the latter approach is conceptually cleaner. The framework is ultimately responsible for its metrics (YARN doesn't even know what metrics there are). We could decide that YARN would look at the framework-specific metrics at the app level and aggregate them from the app level onward to flows, user, and queue. In addition, most frameworks already have an aggregated view of the metrics. It would be very straightforward to emit them at the app level. In summary, option (1) asks the framework to write metrics on its own entities (job, tasks, task attempts) plus YARN container entities. Option (2) asks the framework to write metrics on its own entities (job, tasks, task attempts) plus YARN app entities. IMO, the latter is a more reliable approach. We can discuss this further... > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598529#comment-14598529 ] Tsuyoshi Ozawa commented on YARN-3798: -- [~zxu] Do you have any scenarios the latest patch doesn't cover? > ZKRMStateStore shouldn't create new session without occurrance of > SESSIONEXPIED > --- > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena >Priority: Blocker > Attachments: RM.log, YARN-3798-2.7.002.patch, > YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch > > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at
[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LINTE updated YARN-3840: Summary: Resource Manager web ui issue when sorting application by id (with application having id > ) (was: Resource Manager web ui issue when sorting application by id with id highter than ) > Resource Manager web ui issue when sorting application by id (with > application having id > ) > > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Centos 6.6 > Java 1.7 >Reporter: LINTE > Attachments: RMApps.png > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id with id highter than 9999
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LINTE updated YARN-3840: Summary: Resource Manager web ui issue when sorting application by id with id highter than (was: Resource Manager web ui bug on main view after application number ) > Resource Manager web ui issue when sorting application by id with id highter > than > -- > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Centos 6.6 > Java 1.7 >Reporter: LINTE > Attachments: RMApps.png > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3840) Resource Manager web ui bug on main view after application number 9999
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598419#comment-14598419 ] LINTE commented on YARN-3840: - Hi, xgong, Yes with 2.7.0 yarn version devarak.j, Yes i confirm this is an asc/desc sortingissue with application id over . Regards, > Resource Manager web ui bug on main view after application number > -- > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Centos 6.6 > Java 1.7 >Reporter: LINTE > Attachments: RMApps.png > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598384#comment-14598384 ] Jason Lowe commented on YARN-2902: -- Thanks for updating the patch, Varun! Is one second enough time for the localizer to tear down if the system is heavily loaded, disks are slow, etc.? I think it would be better for the executor to let us know when a localizer has completed rather than assuming 1 second will be enough time (or too much time). We can tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors are tracking localizers today. There are a number of sleeps in the unit test which we should try to avoid if possible. Is there a reason dispatcher.await() isn't sufficient to avoid the races? At a minimum there should be a comment for each one explaining what we're trying to avoid by sleeping. Nit: I've always interpreted the debug delay to be a delay to execute in debugging just before the NM deletes a file. To be consistent it seems that we should be adding the debug delay to any requested delay. That way the NM will always preserve a file for debugDelay seconds _beyond_ what an NM with debugDelay=0 seconds would do. Nit: The TODO in DeletionService about parent being owned by NM, etc. probably only needs to be in the delete method that actually does the work rather than duplicated in veneer methods. Nit: Should "Container killed while downloading" be "Container killed while localizing"? We use localizing elsewhere (e.g.: NM log UI when trying to get logs of a container that is still localizing). Nit: "Inorrect path for PRIVATE localization." should be "Incorrect path for PRIVATE localization: " to fix typo and add trailing space for subsequent filename. Missing a trailing space on the next log message as well. Realize this was just a pre-existing bug, but it would be nice to fix as part of moving the code. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-3069: - Attachment: YARN-3069.013.patch - Fix whitespace - Update against trunk > Document missing properties in yarn-default.xml > --- > > Key: YARN-3069 > URL: https://issues.apache.org/jira/browse/YARN-3069 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: BB2015-05-TBR, supportability > Attachments: YARN-3069.001.patch, YARN-3069.002.patch, > YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, > YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, > YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch, > YARN-3069.012.patch, YARN-3069.013.patch > > > The following properties are currently not defined in yarn-default.xml. > These properties should either be > A) documented in yarn-default.xml OR > B) listed as an exception (with comments, e.g. for internal use) in the > TestYarnConfigurationFields unit test > Any comments for any of the properties below are welcome. > org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker > org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore > security.applicationhistory.protocol.acl > yarn.app.container.log.backups > yarn.app.container.log.dir > yarn.app.container.log.filesize > yarn.client.app-submission.poll-interval > yarn.client.application-client-protocol.poll-timeout-ms > yarn.is.minicluster > yarn.log.server.url > yarn.minicluster.control-resource-monitoring > yarn.minicluster.fixed.ports > yarn.minicluster.use-rpc > yarn.node-labels.fs-store.retry-policy-spec > yarn.node-labels.fs-store.root-dir > yarn.node-labels.manager-class > yarn.nodemanager.container-executor.os.sched.priority.adjustment > yarn.nodemanager.container-monitor.process-tree.class > yarn.nodemanager.disk-health-checker.enable > yarn.nodemanager.docker-container-executor.image-name > yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms > yarn.nodemanager.linux-container-executor.group > yarn.nodemanager.log.deletion-threads-count > yarn.nodemanager.user-home-dir > yarn.nodemanager.webapp.https.address > yarn.nodemanager.webapp.spnego-keytab-file > yarn.nodemanager.webapp.spnego-principal > yarn.nodemanager.windows-secure-container-executor.group > yarn.resourcemanager.configuration.file-system-based-store > yarn.resourcemanager.delegation-token-renewer.thread-count > yarn.resourcemanager.delegation.key.update-interval > yarn.resourcemanager.delegation.token.max-lifetime > yarn.resourcemanager.delegation.token.renew-interval > yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size > yarn.resourcemanager.metrics.runtime.buckets > yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs > yarn.resourcemanager.reservation-system.class > yarn.resourcemanager.reservation-system.enable > yarn.resourcemanager.reservation-system.plan.follower > yarn.resourcemanager.reservation-system.planfollower.time-step > yarn.resourcemanager.rm.container-allocation.expiry-interval-ms > yarn.resourcemanager.webapp.spnego-keytab-file > yarn.resourcemanager.webapp.spnego-principal > yarn.scheduler.include-port-in-node-name > yarn.timeline-service.delegation.key.update-interval > yarn.timeline-service.delegation.token.max-lifetime > yarn.timeline-service.delegation.token.renew-interval > yarn.timeline-service.generic-application-history.enabled > > yarn.timeline-service.generic-application-history.fs-history-store.compression-type > yarn.timeline-service.generic-application-history.fs-history-store.uri > yarn.timeline-service.generic-application-history.store-class > yarn.timeline-service.http-cross-origin.enabled > yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3793: --- Assignee: Varun Saxena (was: Karthik Kambatla) > Several NPEs when deleting local files on NM recovery > - > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Varun Saxena > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598326#comment-14598326 ] Karthik Kambatla commented on YARN-3793: [~varun_saxena] - all yours. > Several NPEs when deleting local files on NM recovery > - > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Varun Saxena > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598313#comment-14598313 ] Sangjin Lee commented on YARN-3045: --- I took a quick pass at the latest patch. First, could you look at the checkstyle issue and the unit test failure? I think the unit test failure is an "existing" issue, but since you looked at it for YARN-3792, it'd be great if you could take another look. It looks like even the APPLICATION_CREATED_EVENT might be seeing the race condition? (NMTimelinePublisher.java) - I'm not 100% clear about the naming convention, but I was under the impression that we're sticking with the name "timelineservice" as the package name? Is it not the case? - l.223: minor nit, but let's make inner classes static unless they need to be non-static - l.252: I'm a bit puzzled by the hashCode override; is it necessary? If so, then we should also override equals. And also, why is it going by only on the app id? - l.296: the same question here > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3045-YARN-2928.002.patch, > YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, > YARN-3045.20150420-1.patch > > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598301#comment-14598301 ] Varun Saxena commented on YARN-3793: Thanks for the pointing this out. Looked for scenarios when disk becomes bad and found one issue. > Several NPEs when deleting local files on NM recovery > - > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598300#comment-14598300 ] Varun Saxena commented on YARN-3793: [~kasha], can I work on this JIRA ? > Several NPEs when deleting local files on NM recovery > - > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598299#comment-14598299 ] Varun Saxena commented on YARN-3793: [~kasha], I think I know whats happening. When disks become bad(say due to disk full), there is a problem when uploading container logs. In {{AppLogAggregatorImpl#doContainerLogAggregation}} only good log directories are considered for log aggregation. This leads to {{AggregatedLogFormat#getPendingLogFilesToUploadForThisContainer}} returning no log files to be uploaded. The caller of {{doContainerLogAggregation}} is {{AppLogAggregatorImpl#uploadLogsForContainers}} which as can be seen under will call {{DeletionService#delete}}. If {{uploadedFilePathsInThisCycle}} is empty *(which will be if disks are full)*, this will lead to both sub directory and base directories being null. This explains the NPEs' being thrown. When these deletion tasks are stored in state store, they will be stored with nulls as well and this can explain why it happens on recovery as well. {code} boolean uploadedLogsInThisCycle = false; for (ContainerId container : pendingContainerInThisCycle) { ContainerLogAggregator aggregator = null; if (containerLogAggregators.containsKey(container)) { aggregator = containerLogAggregators.get(container); } else { aggregator = new ContainerLogAggregator(container); containerLogAggregators.put(container, aggregator); } Set uploadedFilePathsInThisCycle = aggregator.doContainerLogAggregation(writer, appFinished); if (uploadedFilePathsInThisCycle.size() > 0) { uploadedLogsInThisCycle = true; } this.delService.delete(this.userUgi.getShortUserName(), null, uploadedFilePathsInThisCycle .toArray(new Path[uploadedFilePathsInThisCycle.size()])); .. } {code} Log aggregation should consider full disks as well otherwise there will be nothing to be aggregated if disks are full. Anyways log aggregation would lead to deletion of local logs. I verified the occurrence of this issue via TestLogAggregationService#testLocalFileDeletionAfterUpload by making good log directories return nothing. > Several NPEs when deleting local files on NM recovery > - > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3806) Proposal of Generic Scheduling Framework for YARN
[ https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598163#comment-14598163 ] Chris Douglas commented on YARN-3806: - [~wshao] Please don't delete obsoleted versions of the design doc, as it orphans discussion about them. Also, as you're making updates, please note the changes so people don't have to diff the docs. > Proposal of Generic Scheduling Framework for YARN > - > > Key: YARN-3806 > URL: https://issues.apache.org/jira/browse/YARN-3806 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wei Shao > Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.05.pdf, > ProposalOfGenericSchedulingFrameworkForYARN-V1.06.pdf > > > Currently, a typical YARN cluster runs many different kinds of applications: > production applications, ad hoc user applications, long running services and > so on. Different YARN scheduling policies may be suitable for different > applications. For example, capacity scheduling can manage production > applications well since application can get guaranteed resource share, fair > scheduling can manage ad hoc user applications well since it can enforce > fairness among users. However, current YARN scheduling framework doesn’t have > a mechanism for multiple scheduling policies work hierarchically in one > cluster. > YARN-3306 talked about many issues of today’s YARN scheduling framework, and > proposed a per-queue policy driven framework. In detail, it supported > different scheduling policies for leaf queues. However, support of different > scheduling policies for upper level queues is not seriously considered yet. > A generic scheduling framework is proposed here to address these limitations. > It supports different policies (fair, capacity, fifo and so on) for any queue > consistently. The proposal tries to solve many other issues in current YARN > scheduling framework as well. > Two new proposed scheduling policies YARN-3807 & YARN-3808 are based on > generic scheduling framework brought up in this proposal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598054#comment-14598054 ] Sangjin Lee commented on YARN-3045: --- {quote} The lifecycle management of app collector is a little tricky here: it get registered when the first container (AM) get launched, but should not unregistered immediately when AM container get stop. May be wait for application finish event comes to NM should work for most cases. For corner case that NM publisher delay too long time (queue is busy) to publish event, it still get chance to fail (very low chance should be acceptable here). Later, we will run to similar issue again when we are doing app level aggregation in app collector that the aggregation process could still be running. In any case, we should pay special attention to lifecycle management for collector - we have a separated JIRA to move it out of auxiliary service. I think we can discuss more on this together with/in that JIRA. {quote} It's a good point. I think some amount of "linger" after the AM container is completed should be a fine solution. Note that not only the collector needs to be up but also the mapping should not be removed from the RM for this to work. As [~djp] pointed out, having multiple app attempts (AMs) is another case. Perhaps the same linger can apply in that case so that the collector can stick around to handle some writes until the next collector that belongs to the next AM comes online and registers itself. We need to hash out the details of multiple AMs scenario, preferably in a different JIRA. > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3045-YARN-2928.002.patch, > YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, > YARN-3045.20150420-1.patch > > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it
[ https://issues.apache.org/jira/browse/YARN-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598050#comment-14598050 ] zhihai xu commented on YARN-3831: - [~hex108], thanks for the confirmation! > Localization failed when a local disk turns from bad to good without NM > initializes it > -- > > Key: YARN-3831 > URL: https://issues.apache.org/jira/browse/YARN-3831 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > > A local disk turns from bad to good without NM initializes it(create > /path-to-local-dir/usercache and /path-to-local-dir/filecache). When > localizing a container, container-executor will try to create directories > under /path-to-local-dir/usercache, and it will fail. Then container's > localization will fail. > Related log is as following: > {noformat} > 2015-06-19 18:00:01,205 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1431957472783_38706012_01_000465 > 2015-06-19 18:00:01,212 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Writing credentials to the nmPrivate file > /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens. > Credentials list: > 2015-06-19 18:00:01,216 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1431957472783_38706012_01_000465 startLocalizer is : > 20 > org.apache.hadoop.util.Shell$ExitCodeException: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command > provided 0 > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is > tdwadmin > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create > directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Localizer failed > java.io.IOException: Application application_1431957472783_38706012 > initialization failed (exitCode=20) with output: main : command provided 0 > main : user is tdwadmin > Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such > file or directory > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) > Caused by: org.apache.hadoop.util.Shell$ExitCodeException: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) > ... 1 more > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1431957472783_38706012_01_000465 transitioned from > LOCALIZING to LOCALIZATION_FAILED > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598044#comment-14598044 ] Masatake Iwasaki commented on YARN-3790: I'm +1(non-binding) too. Thanks for working on this. I saw the test failure 2 times on YARN-3705 and would like this to come in. > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, test >Reporter: Rohith Sharma K S >Assignee: zhihai xu > Attachments: YARN-3790.000.patch > > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598036#comment-14598036 ] Hadoop QA commented on YARN-2801: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 2m 55s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 19s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 3m 0s | Site still builds. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | | | 6m 17s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741328/YARN-2801.3.patch | | Optional Tests | site | | git revision | trunk / 41ae776 | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8327/console | This message was automatically generated. > Documentation development for Node labels requirment > > > Key: YARN-2801 > URL: https://issues.apache.org/jira/browse/YARN-2801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Gururaj Shetty >Assignee: Wangda Tan > Attachments: YARN-2801.1.patch, YARN-2801.2.patch, YARN-2801.3.patch > > > Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598027#comment-14598027 ] Sangjin Lee commented on YARN-2902: --- I'm OK with this JIRA proceeding as is. We'll need to isolate the public resource case more, and it won't be too late to file a separate issue if we do that later. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2801: - Attachment: YARN-2801.3.patch Thanks [~Naganarasimha] for additional review, attached ver.3 patch. > Documentation development for Node labels requirment > > > Key: YARN-2801 > URL: https://issues.apache.org/jira/browse/YARN-2801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Gururaj Shetty >Assignee: Wangda Tan > Attachments: YARN-2801.1.patch, YARN-2801.2.patch, YARN-2801.3.patch > > > Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598012#comment-14598012 ] Junping Du commented on YARN-3045: -- Thanks [~Naganarasimha] for updating the patch! Looking into it now, some comments will be after. Some quickly thoughts on your question above. bq. I prefer to have all the container related events and entities to be published by NMTimelinePublisher, so wanted push container usage metrics also to NMTimelinePublisher. This will ensure all NM timeline stuff are put in one place and remove thread pool handling in ContainerMonitorImpl. I am generally fine for consolidating the publishment of events and metrics with NMTimelinePublisher. However, we may check if need separated event queue later to make sure container metrics boom up won't affect events get published. bq. When the AM container finishes and removes the collector for the app, still there is possibility that all the events published for the app by the current NM and other NM are still in pipeline, so was wondering whether we can have timer task which periodically cleans up collector after some period and not imm remove it when AM container is finished. The lifecycle management of app collector is a little tricky here: it get registered when the first container (AM) get launched, but should not unregistered immediately when AM container get stop. May be wait for application finish event comes to NM should work for most cases. For corner case that NM publisher delay too long time (queue is busy) to publish event, it still get chance to fail (very low chance should be acceptable here). Later, we will run to similar issue again when we are doing app level aggregation in app collector that the aggregation process could still be running. In any case, we should pay special attention to lifecycle management for collector - we have a separated JIRA to move it out of auxiliary service. I think we can discuss more on this together with/in that JIRA. > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3045-YARN-2928.002.patch, > YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, > YARN-3045.20150420-1.patch > > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597971#comment-14597971 ] Masatake Iwasaki commented on YARN-2871: Thanks for working on this, [~zxu]! These intermittent test failures annoyed me these days. {code} 980 Thread.sleep(1000); {code} Is it possible to use {{MockRM#waitForState}} to wait for the application state is recovered? Sleeping fixed time is not certain and it makes test time longer unnecessary, though there are many lines calling Thread#sleep in the test... > TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk > - > > Key: YARN-2871 > URL: https://issues.apache.org/jira/browse/YARN-2871 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2871.000.patch > > > From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): > {code} > Failed tests: > TestRMRestart.testRMRestartGetApplicationList:957 > rMAppManager.logApplicationSummary( > isA(org.apache.hadoop.yarn.api.records.ApplicationId) > ); > Wanted 3 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) > But was 2 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597938#comment-14597938 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 47s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 37s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 37s | The applied patch generated 9 new checkstyle issues (total was 168, now 138). | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 6s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 43m 32s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741309/YARN-2902.04.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 41ae776 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8326/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8326/console | This message was automatically generated. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1488) Allow containers to delegate resources to another container
[ https://issues.apache.org/jira/browse/YARN-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597920#comment-14597920 ] Lei Guo commented on YARN-1488: --- Based on the information from [Stinger.next track | http://hortonworks.com/blog/evolving-apache-hadoop-yarn-provide-resource-workload-management-services/], this Jira should be the foundation of the YARN/LLAP integration, is there any plan/design for this Jira? > Allow containers to delegate resources to another container > --- > > Key: YARN-1488 > URL: https://issues.apache.org/jira/browse/YARN-1488 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy >Assignee: Arun C Murthy > > We should allow containers to delegate resources to another container. This > would allow external frameworks to share not just YARN's resource-management > capabilities but also it's workload-management capabilities. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597873#comment-14597873 ] Jason Lowe commented on YARN-3832: -- Ah, I think that might be the clue as to what went wrong. If the NM recreated the state store on startup then ResourceLocalizationService will try to cleanup the localized resources to prevent them from getting out of sync with the state store. Unfortunately the code does this: {code} private void cleanUpLocalDirs(FileContext lfs, DeletionService del) { for (String localDir : dirsHandler.getLocalDirs()) { cleanUpLocalDir(lfs, del, localDir); } {code} It should be calling dirsHandler.getLocalDirsForCleanup, since getLocalDirs will not include any disks that are full. Since the disk was too full, it probably wasn't in the list of local dirs and therefore we avoided cleaning up the localized resources on the disk. Later when the disk became good it tried to use it, but at that point the state store and localized resources on that disk are out of sync and new localizations can collide with old ones. > Resource Localization fails on a cluster due to existing cache directories > -- > > Key: YARN-3832 > URL: https://issues.apache.org/jira/browse/YARN-3832 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Ranga Swamy >Assignee: Brahma Reddy Battula > > *We have found resource localization fails on a cluster with following > error.* > > Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624) > {noformat} > Application application_1434703279149_0057 failed 2 times due to AM Container > for appattempt_1434703279149_0057_02 exited with exitCode: -1000 > For more detailed output, check application tracking > page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then, > click on links to logs of each attempt. > Diagnostics: Rename cannot overwrite non empty destination directory > /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 > java.io.IOException: Rename cannot overwrite non empty destination directory > /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735) > at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244) > at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Failing this attempt. Failing the application. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597863#comment-14597863 ] Varun Saxena commented on YARN-2902: Fixed checkstyle issues. A lot of changes in {{ResourceLocalizationService#findNextResource}} are due to indentation issues reported by checkstyle. Hence, had to change code(indent) which I had not written. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2902: --- Attachment: YARN-2902.04.patch > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException
[ https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597825#comment-14597825 ] Hudson commented on YARN-3842: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2183 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2183/]) YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java > NMProxy should retry on NMNotYetReadyException > -- > > Key: YARN-3842 > URL: https://issues.apache.org/jira/browse/YARN-3842 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter >Priority: Critical > Fix For: 2.7.1 > > Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, > YARN-3842.001.patch, YARN-3842.002.patch > > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597823#comment-14597823 ] Hudson commented on YARN-3835: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2183 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2183/]) YARN-3835. hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 99271b762129d78c86f3c9733a24c77962b0b3f7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml * hadoop-yarn-project/CHANGES.txt > hadoop-yarn-server-resourcemanager test package bundles core-site.xml, > yarn-site.xml > > > Key: YARN-3835 > URL: https://issues.apache.org/jira/browse/YARN-3835 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Vamsee Yarlagadda >Assignee: Vamsee Yarlagadda >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3835.patch > > > It looks like by default yarn is bundling core-site.xml, yarn-site.xml in > test artifact of hadoop-yarn-server-resourcemanager which means that any > downstream project which uses this a dependency can have a problem in picking > up the user supplied/environment supplied core-site.xml, yarn-site.xml > So we should ideally exclude these .xml files from being bundled into the > test-jar. (Similar to YARN-1748) > I also proactively looked at other YARN modules where this might be > happening. > {code} > vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml" > ./hadoop-yarn/conf/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml > {code} > And out of these only two modules (hadoop-yarn-server-resourcemanager, > hadoop-yarn-server-tests) are building test-jars. In future, if we start > building test-jar of other modules, we should exclude these xml files from > being bundled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597817#comment-14597817 ] Brahma Reddy Battula commented on YARN-3793: [~kasha] one possible scenario is : When disk became bad and NM stopped.. I had seen this NPE( where good dir's will be null).. {noformat} 2015-06-19 03:09:10,528 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Uploading logs for container container_1434452428753_0522_01_000162. Current good log dirs are 2015-06-19 03:09:10,528 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during execution of task in DeletionService java.lang.NullPointerException at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) at org.apache.hadoop.fs.FileContext.delete(FileContext.java:761) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} > Several NPEs when deleting local files on NM recovery > - > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597810#comment-14597810 ] Jason Lowe commented on YARN-3809: -- +1 latest patch lgtm. Will commit this later today if there are no objections. > Failed to launch new attempts because ApplicationMasterLauncher's threads all > hang > -- > > Key: YARN-3809 > URL: https://issues.apache.org/jira/browse/YARN-3809 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3809.01.patch, YARN-3809.02.patch, > YARN-3809.03.patch > > > ApplicationMasterLauncher create a thread pool whose size is 10 to deal with > AMLauncherEventType(LAUNCH and CLEANUP). > In our cluster, there was many NM with 10+ AM running on it, and one shut > down for some reason. After RM found the NM LOST, it cleaned up AMs running > on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. > ApplicationMasterLauncher's thread pool would be filled up, and they all hang > in the code containerMgrProxy.stopContainers(stopRequest) because NM was > down, the default RPC time out is 15 mins. It means that in 15 mins > ApplicationMasterLauncher could not handle new event such as LAUNCH, then new > attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597805#comment-14597805 ] Brahma Reddy Battula commented on YARN-3832: [~jlowe] Sorry for late reply...After look into logs,,Came to know that *disk declared bad ( since it's reached 90%) and nodes became unhealthy* {noformat} 2015-06-19 04:39:18,498 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/hdfsdata/HA/nmlocal error, used space above threshold of 90.0%, removing from list of valid directories 2015-06-19 04:39:18,498 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/hdfsdata/HA/nmlog error, used space above threshold of 90.0%, removing from list of valid directories 2015-06-19 04:39:18,498 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /opt/hdfsdata/HA/nmlocal; 1/1 log-dirs are bad: /opt/hdfsdata/HA/nmlog 2015-06-19 04:39:18,499 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs are bad: /opt/hdfsdata/HA/nmlocal; 1/1 log-dirs are bad: /opt/hdfsdata/HA/nmlog {noformat} On restart of NM, those disk turn to good.. 2015-06-19 04:47:18,765 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) turned good: 1/1 local-dirs are good: /opt/hdfsdata/HA/nmlocal; 1/1 log-dirs are good: /opt/hdfsdata/HA/nmlog.. > Resource Localization fails on a cluster due to existing cache directories > -- > > Key: YARN-3832 > URL: https://issues.apache.org/jira/browse/YARN-3832 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Ranga Swamy >Assignee: Brahma Reddy Battula > > *We have found resource localization fails on a cluster with following > error.* > > Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624) > {noformat} > Application application_1434703279149_0057 failed 2 times due to AM Container > for appattempt_1434703279149_0057_02 exited with exitCode: -1000 > For more detailed output, check application tracking > page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then, > click on links to logs of each attempt. > Diagnostics: Rename cannot overwrite non empty destination directory > /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 > java.io.IOException: Rename cannot overwrite non empty destination directory > /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735) > at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244) > at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Failing this attempt. Failing the application. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3844) Make hadoop-yarn-project Native code -Wall-clean
Alan Burlison created YARN-3844: --- Summary: Make hadoop-yarn-project Native code -Wall-clean Key: YARN-3844 URL: https://issues.apache.org/jira/browse/YARN-3844 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.7.0 Environment: As we specify -Wall as a default compilation flag, it would be helpful if the Native code was -Wall-clean Reporter: Alan Burlison Assignee: Alan Burlison -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3844) Make hadoop-yarn-project Native code -Wall-clean
[ https://issues.apache.org/jira/browse/YARN-3844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison updated YARN-3844: Description: As we specify -Wall as a default compilation flag, it would be helpful if the Native code was -Wall-clean > Make hadoop-yarn-project Native code -Wall-clean > > > Key: YARN-3844 > URL: https://issues.apache.org/jira/browse/YARN-3844 > Project: Hadoop YARN > Issue Type: Sub-task > Components: build >Affects Versions: 2.7.0 > Environment: As we specify -Wall as a default compilation flag, it > would be helpful if the Native code was -Wall-clean >Reporter: Alan Burlison >Assignee: Alan Burlison > > As we specify -Wall as a default compilation flag, it would be helpful if the > Native code was -Wall-clean -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException
[ https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597783#comment-14597783 ] Hudson commented on YARN-3842: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #235 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/235/]) YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * hadoop-yarn-project/CHANGES.txt > NMProxy should retry on NMNotYetReadyException > -- > > Key: YARN-3842 > URL: https://issues.apache.org/jira/browse/YARN-3842 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter >Priority: Critical > Fix For: 2.7.1 > > Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, > YARN-3842.001.patch, YARN-3842.002.patch > > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597781#comment-14597781 ] Hudson commented on YARN-3835: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #235 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/235/]) YARN-3835. hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 99271b762129d78c86f3c9733a24c77962b0b3f7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml * hadoop-yarn-project/CHANGES.txt > hadoop-yarn-server-resourcemanager test package bundles core-site.xml, > yarn-site.xml > > > Key: YARN-3835 > URL: https://issues.apache.org/jira/browse/YARN-3835 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Vamsee Yarlagadda >Assignee: Vamsee Yarlagadda >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3835.patch > > > It looks like by default yarn is bundling core-site.xml, yarn-site.xml in > test artifact of hadoop-yarn-server-resourcemanager which means that any > downstream project which uses this a dependency can have a problem in picking > up the user supplied/environment supplied core-site.xml, yarn-site.xml > So we should ideally exclude these .xml files from being bundled into the > test-jar. (Similar to YARN-1748) > I also proactively looked at other YARN modules where this might be > happening. > {code} > vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml" > ./hadoop-yarn/conf/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml > {code} > And out of these only two modules (hadoop-yarn-server-resourcemanager, > hadoop-yarn-server-tests) are building test-jars. In future, if we start > building test-jar of other modules, we should exclude these xml files from > being bundled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException
[ https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597722#comment-14597722 ] Hudson commented on YARN-3842: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #226 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/226/]) YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt > NMProxy should retry on NMNotYetReadyException > -- > > Key: YARN-3842 > URL: https://issues.apache.org/jira/browse/YARN-3842 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter >Priority: Critical > Fix For: 2.7.1 > > Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, > YARN-3842.001.patch, YARN-3842.002.patch > > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597720#comment-14597720 ] Hudson commented on YARN-3835: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #226 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/226/]) YARN-3835. hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 99271b762129d78c86f3c9733a24c77962b0b3f7) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml > hadoop-yarn-server-resourcemanager test package bundles core-site.xml, > yarn-site.xml > > > Key: YARN-3835 > URL: https://issues.apache.org/jira/browse/YARN-3835 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Vamsee Yarlagadda >Assignee: Vamsee Yarlagadda >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3835.patch > > > It looks like by default yarn is bundling core-site.xml, yarn-site.xml in > test artifact of hadoop-yarn-server-resourcemanager which means that any > downstream project which uses this a dependency can have a problem in picking > up the user supplied/environment supplied core-site.xml, yarn-site.xml > So we should ideally exclude these .xml files from being bundled into the > test-jar. (Similar to YARN-1748) > I also proactively looked at other YARN modules where this might be > happening. > {code} > vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml" > ./hadoop-yarn/conf/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml > {code} > And out of these only two modules (hadoop-yarn-server-resourcemanager, > hadoop-yarn-server-tests) are building test-jars. In future, if we start > building test-jar of other modules, we should exclude these xml files from > being bundled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597706#comment-14597706 ] Hudson commented on YARN-3835: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2165 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2165/]) YARN-3835. hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 99271b762129d78c86f3c9733a24c77962b0b3f7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml * hadoop-yarn-project/CHANGES.txt > hadoop-yarn-server-resourcemanager test package bundles core-site.xml, > yarn-site.xml > > > Key: YARN-3835 > URL: https://issues.apache.org/jira/browse/YARN-3835 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Vamsee Yarlagadda >Assignee: Vamsee Yarlagadda >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3835.patch > > > It looks like by default yarn is bundling core-site.xml, yarn-site.xml in > test artifact of hadoop-yarn-server-resourcemanager which means that any > downstream project which uses this a dependency can have a problem in picking > up the user supplied/environment supplied core-site.xml, yarn-site.xml > So we should ideally exclude these .xml files from being bundled into the > test-jar. (Similar to YARN-1748) > I also proactively looked at other YARN modules where this might be > happening. > {code} > vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml" > ./hadoop-yarn/conf/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml > {code} > And out of these only two modules (hadoop-yarn-server-resourcemanager, > hadoop-yarn-server-tests) are building test-jars. In future, if we start > building test-jar of other modules, we should exclude these xml files from > being bundled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException
[ https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597708#comment-14597708 ] Hudson commented on YARN-3842: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2165 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2165/]) YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * hadoop-yarn-project/CHANGES.txt > NMProxy should retry on NMNotYetReadyException > -- > > Key: YARN-3842 > URL: https://issues.apache.org/jira/browse/YARN-3842 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter >Priority: Critical > Fix For: 2.7.1 > > Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, > YARN-3842.001.patch, YARN-3842.002.patch > > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-1965) Interrupted exception when closing YarnClient
[ https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla reassigned YARN-1965: - Assignee: Kuhu Shukla > Interrupted exception when closing YarnClient > - > > Key: YARN-1965 > URL: https://issues.apache.org/jira/browse/YARN-1965 > Project: Hadoop YARN > Issue Type: Bug > Components: api >Affects Versions: 2.3.0 >Reporter: Oleg Zhurakousky >Assignee: Kuhu Shukla >Priority: Minor > Labels: newbie > > Its more of a nuisance then a bug, but nevertheless > {code} > 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting > for clientExecutorto stop > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072) > at > java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468) > at > org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191) > at org.apache.hadoop.ipc.Client.stop(Client.java:1235) > at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251) > at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112) > at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621) > at > org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206) > at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > . . . > {code} > It happens sporadically when stopping YarnClient. > Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious > why and who throws the interrupt but in any event it should not be logged as > ERROR. Probably a WARN with no stack trace. > Also, for consistency and correctness you may want to Interrupt current > thread as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597622#comment-14597622 ] Jun Gong commented on YARN-3809: Same as previous explanation, checkstyle and test case error are not related. > Failed to launch new attempts because ApplicationMasterLauncher's threads all > hang > -- > > Key: YARN-3809 > URL: https://issues.apache.org/jira/browse/YARN-3809 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3809.01.patch, YARN-3809.02.patch, > YARN-3809.03.patch > > > ApplicationMasterLauncher create a thread pool whose size is 10 to deal with > AMLauncherEventType(LAUNCH and CLEANUP). > In our cluster, there was many NM with 10+ AM running on it, and one shut > down for some reason. After RM found the NM LOST, it cleaned up AMs running > on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. > ApplicationMasterLauncher's thread pool would be filled up, and they all hang > in the code containerMgrProxy.stopContainers(stopRequest) because NM was > down, the default RPC time out is 15 mins. It means that in 15 mins > ApplicationMasterLauncher could not handle new event such as LAUNCH, then new > attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it
[ https://issues.apache.org/jira/browse/YARN-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-3831. Resolution: Not A Problem > Localization failed when a local disk turns from bad to good without NM > initializes it > -- > > Key: YARN-3831 > URL: https://issues.apache.org/jira/browse/YARN-3831 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > > A local disk turns from bad to good without NM initializes it(create > /path-to-local-dir/usercache and /path-to-local-dir/filecache). When > localizing a container, container-executor will try to create directories > under /path-to-local-dir/usercache, and it will fail. Then container's > localization will fail. > Related log is as following: > {noformat} > 2015-06-19 18:00:01,205 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1431957472783_38706012_01_000465 > 2015-06-19 18:00:01,212 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Writing credentials to the nmPrivate file > /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens. > Credentials list: > 2015-06-19 18:00:01,216 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1431957472783_38706012_01_000465 startLocalizer is : > 20 > org.apache.hadoop.util.Shell$ExitCodeException: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command > provided 0 > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is > tdwadmin > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create > directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Localizer failed > java.io.IOException: Application application_1431957472783_38706012 > initialization failed (exitCode=20) with output: main : command provided 0 > main : user is tdwadmin > Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such > file or directory > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) > Caused by: org.apache.hadoop.util.Shell$ExitCodeException: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) > ... 1 more > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1431957472783_38706012_01_000465 transitioned from > LOCALIZING to LOCALIZATION_FAILED > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3831) Localization failed when a local disk turns from bad to good without NM initializes it
[ https://issues.apache.org/jira/browse/YARN-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597616#comment-14597616 ] Jun Gong commented on YARN-3831: [~zxu], thank you for the remind. Sorry for late reply. The bug was found in version 2.2.0. I checked latest code. It seems have been fixed: there is a 'localDirsChangeListener' to handle 'onDirsChanged', when a local disk turns from bad to good, 'localDirsChangeListener' will try to initialize it. Closing it now. > Localization failed when a local disk turns from bad to good without NM > initializes it > -- > > Key: YARN-3831 > URL: https://issues.apache.org/jira/browse/YARN-3831 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > > A local disk turns from bad to good without NM initializes it(create > /path-to-local-dir/usercache and /path-to-local-dir/filecache). When > localizing a container, container-executor will try to create directories > under /path-to-local-dir/usercache, and it will fail. Then container's > localization will fail. > Related log is as following: > {noformat} > 2015-06-19 18:00:01,205 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1431957472783_38706012_01_000465 > 2015-06-19 18:00:01,212 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Writing credentials to the nmPrivate file > /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens. > Credentials list: > 2015-06-19 18:00:01,216 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1431957472783_38706012_01_000465 startLocalizer is : > 20 > org.apache.hadoop.util.Shell$ExitCodeException: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command > provided 0 > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is > tdwadmin > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create > directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Localizer failed > java.io.IOException: Application application_1431957472783_38706012 > initialization failed (exitCode=20) with output: main : command provided 0 > main : user is tdwadmin > Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such > file or directory > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) > Caused by: org.apache.hadoop.util.Shell$ExitCodeException: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) > ... 1 more > 2015-06-19 18:00:01,216 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1431957472783_38706012_01_000465 transitioned from > LOCALIZING to LOCALIZATION_FAILED > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597532#comment-14597532 ] Hudson commented on YARN-3835: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/237/]) YARN-3835. hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 99271b762129d78c86f3c9733a24c77962b0b3f7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml * hadoop-yarn-project/CHANGES.txt > hadoop-yarn-server-resourcemanager test package bundles core-site.xml, > yarn-site.xml > > > Key: YARN-3835 > URL: https://issues.apache.org/jira/browse/YARN-3835 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Vamsee Yarlagadda >Assignee: Vamsee Yarlagadda >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3835.patch > > > It looks like by default yarn is bundling core-site.xml, yarn-site.xml in > test artifact of hadoop-yarn-server-resourcemanager which means that any > downstream project which uses this a dependency can have a problem in picking > up the user supplied/environment supplied core-site.xml, yarn-site.xml > So we should ideally exclude these .xml files from being bundled into the > test-jar. (Similar to YARN-1748) > I also proactively looked at other YARN modules where this might be > happening. > {code} > vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml" > ./hadoop-yarn/conf/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml > {code} > And out of these only two modules (hadoop-yarn-server-resourcemanager, > hadoop-yarn-server-tests) are building test-jars. In future, if we start > building test-jar of other modules, we should exclude these xml files from > being bundled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException
[ https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597534#comment-14597534 ] Hudson commented on YARN-3842: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/237/]) YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * hadoop-yarn-project/CHANGES.txt > NMProxy should retry on NMNotYetReadyException > -- > > Key: YARN-3842 > URL: https://issues.apache.org/jira/browse/YARN-3842 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter >Priority: Critical > Fix For: 2.7.1 > > Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, > YARN-3842.001.patch, YARN-3842.002.patch > > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3840) Resource Manager web ui bug on main view after application number 9999
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597527#comment-14597527 ] Devaraj K commented on YARN-3840: - [~Alexandre LINTE], This seems to be sorting issue with respect to the app ids. It is just considering the first four digits of the application number for sorting in ascending/descing order, due to this it is not showing the application ids having more that based on the order and it is mixing up with the other apps considering only the first 4 digits. You can see the attached image RMApps.png which shows displaying the apps having id> with other apps, !RMApps.png|thumbnail! Please check and confirm whether is it happening same or not in your case by searching for the specific app id in the search box. Thanks. > Resource Manager web ui bug on main view after application number > -- > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Centos 6.6 > Java 1.7 >Reporter: LINTE > Attachments: RMApps.png > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3840) Resource Manager web ui bug on main view after application number 9999
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3840: Attachment: RMApps.png > Resource Manager web ui bug on main view after application number > -- > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Centos 6.6 > Java 1.7 >Reporter: LINTE > Attachments: RMApps.png > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3842) NMProxy should retry on NMNotYetReadyException
[ https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597512#comment-14597512 ] Hudson commented on YARN-3842: -- FAILURE: Integrated in Hadoop-Yarn-trunk #967 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/967/]) YARN-3842. NMProxy should retry on NMNotYetReadyException. (Robert Kanter via kasha) (kasha: rev 5ebf2817e58e1be8214dc1916a694a912075aa0a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java > NMProxy should retry on NMNotYetReadyException > -- > > Key: YARN-3842 > URL: https://issues.apache.org/jira/browse/YARN-3842 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter >Priority: Critical > Fix For: 2.7.1 > > Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, > YARN-3842.001.patch, YARN-3842.002.patch > > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3835) hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597510#comment-14597510 ] Hudson commented on YARN-3835: -- FAILURE: Integrated in Hadoop-Yarn-trunk #967 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/967/]) YARN-3835. hadoop-yarn-server-resourcemanager test package bundles core-site.xml, yarn-site.xml (vamsee via rkanter) (rkanter: rev 99271b762129d78c86f3c9733a24c77962b0b3f7) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml > hadoop-yarn-server-resourcemanager test package bundles core-site.xml, > yarn-site.xml > > > Key: YARN-3835 > URL: https://issues.apache.org/jira/browse/YARN-3835 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Vamsee Yarlagadda >Assignee: Vamsee Yarlagadda >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3835.patch > > > It looks like by default yarn is bundling core-site.xml, yarn-site.xml in > test artifact of hadoop-yarn-server-resourcemanager which means that any > downstream project which uses this a dependency can have a problem in picking > up the user supplied/environment supplied core-site.xml, yarn-site.xml > So we should ideally exclude these .xml files from being bundled into the > test-jar. (Similar to YARN-1748) > I also proactively looked at other YARN modules where this might be > happening. > {code} > vamsee-MBP:hadoop-yarn-project vamsee$ find . -name "*-site.xml" > ./hadoop-yarn/conf/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/resources/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/core-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes/yarn-site.xml > ./hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/core-site.xml > {code} > And out of these only two modules (hadoop-yarn-server-resourcemanager, > hadoop-yarn-server-tests) are building test-jars. In future, if we start > building test-jar of other modules, we should exclude these xml files from > being bundled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597488#comment-14597488 ] Hadoop QA commented on YARN-3045: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 48s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 7m 58s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 49s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 32s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 2s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 38s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 41s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 59s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 9m 17s | Tests failed in hadoop-yarn-applications-distributedshell. | | {color:green}+1{color} | yarn tests | 6m 10s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 54m 27s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.applications.distributedshell.TestDistributedShell | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12740912/YARN-3045-YARN-2928.004.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | YARN-2928 / 84f37f1 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8325/artifact/patchprocess/whitespace.txt | | hadoop-yarn-applications-distributedshell test log | https://builds.apache.org/job/PreCommit-YARN-Build/8325/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8325/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8325/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8325/console | This message was automatically generated. > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3045-YARN-2928.002.patch, > YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, > YARN-3045.20150420-1.patch > > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3045: Labels: (was: BB2015-05-TBR) > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3045-YARN-2928.002.patch, > YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, > YARN-3045.20150420-1.patch > > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597459#comment-14597459 ] Hadoop QA commented on YARN-2871: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 6m 41s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 45s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 40s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 69m 38s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741254/YARN-2871.000.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / 41ae776 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8324/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8324/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8324/console | This message was automatically generated. > TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk > - > > Key: YARN-2871 > URL: https://issues.apache.org/jira/browse/YARN-2871 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2871.000.patch > > > From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): > {code} > Failed tests: > TestRMRestart.testRMRestartGetApplicationList:957 > rMAppManager.logApplicationSummary( > isA(org.apache.hadoop.yarn.api.records.ApplicationId) > ); > Wanted 3 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) > But was 2 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597392#comment-14597392 ] zhihai xu commented on YARN-2871: - I uploaded a patch YARN-2871.000.patch for review. > TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk > - > > Key: YARN-2871 > URL: https://issues.apache.org/jira/browse/YARN-2871 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2871.000.patch > > > From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): > {code} > Failed tests: > TestRMRestart.testRMRestartGetApplicationList:957 > rMAppManager.logApplicationSummary( > isA(org.apache.hadoop.yarn.api.records.ApplicationId) > ); > Wanted 3 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) > But was 2 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2871: Attachment: YARN-2871.000.patch > TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk > - > > Key: YARN-2871 > URL: https://issues.apache.org/jira/browse/YARN-2871 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2871.000.patch > > > From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): > {code} > Failed tests: > TestRMRestart.testRMRestartGetApplicationList:957 > rMAppManager.logApplicationSummary( > isA(org.apache.hadoop.yarn.api.records.ApplicationId) > ); > Wanted 3 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) > But was 2 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597368#comment-14597368 ] zhihai xu commented on YARN-2871: - I can work on this issue, Based on the failure logs https://builds.apache.org/job/PreCommit-YARN-Build/8323/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testRMRestartGetApplicationList_1_/, the root cause of this issue is a race condition in the test. {{logApplicationSummary}} is called when RMAppManager handles APP_COMPLETED RMAppManagerEvent. RMAppImpl sends APP_COMPLETED event to AsyncDispatcher thread. If AsyncDispatcher thread doesn't process APP_COMPLETED event on time, then the test will fail. I think If we add some delay before the verification, it will fix this issue. The important logs from failed test: {code} 2015-06-23 06:06:20,484 INFO [Thread-693] resourcemanager.ResourceManager (ResourceManager.java:serviceStart(572)) - Recovery started 2015-06-23 06:06:20,484 INFO [Thread-693] security.RMDelegationTokenSecretManager (RMDelegationTokenSecretManager.java:recover(178)) - recovering RMDelegationTokenSecretManager. 2015-06-23 06:06:20,484 INFO [Thread-693] resourcemanager.RMAppManager (RMAppManager.java:recover(425)) - Recovering 3 applications 2015-06-23 06:06:20,485 DEBUG [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1435039562888_0001 of type RECOVER 2015-06-23 06:06:20,485 INFO [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:recover(781)) - Recovering app: application_1435039562888_0001 with 1 attempts and final state = FINISHED 2015-06-23 06:06:20,485 INFO [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:recover(827)) - Recovering attempt: appattempt_1435039562888_0001_01 with final state: FINISHED 2015-06-23 06:06:20,485 DEBUG [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event for appattempt_1435039562888_0001_01 of type RECOVER 2015-06-23 06:06:20,486 INFO [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(793)) - appattempt_1435039562888_0001_01 State change from NEW to FINISHED 2015-06-23 06:06:20,486 INFO [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:handle(768)) - application_1435039562888_0001 State change from NEW to FINISHED 2015-06-23 06:06:20,486 DEBUG [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1435039562888_0002 of type RECOVER 2015-06-23 06:06:20,486 INFO [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:recover(781)) - Recovering app: application_1435039562888_0002 with 1 attempts and final state = FAILED 2015-06-23 06:06:20,487 INFO [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:recover(827)) - Recovering attempt: appattempt_1435039562888_0002_01 with final state: FAILED 2015-06-23 06:06:20,487 DEBUG [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event for appattempt_1435039562888_0002_01 of type RECOVER 2015-06-23 06:06:20,487 INFO [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(793)) - appattempt_1435039562888_0002_01 State change from NEW to FAILED 2015-06-23 06:06:20,487 INFO [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:handle(768)) - application_1435039562888_0002 State change from NEW to FAILED 2015-06-23 06:06:20,488 DEBUG [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1435039562888_0003 of type RECOVER 2015-06-23 06:06:20,488 INFO [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:recover(781)) - Recovering app: application_1435039562888_0003 with 1 attempts and final state = KILLED 2015-06-23 06:06:20,488 INFO [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:recover(827)) - Recovering attempt: appattempt_1435039562888_0003_01 with final state: KILLED 2015-06-23 06:06:20,489 DEBUG [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event for appattempt_1435039562888_0003_01 of type RECOVER 2015-06-23 06:06:20,489 INFO [Thread-693] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(793)) - appattempt_1435039562888_0003_01 State change from NEW to KILLED 2015-06-23 06:06:20,489 INFO [Thread-693] rmapp.RMAppImpl (RMAppImpl.java:handle(768)) - application_1435039562888_0003 State change from NEW to KILLED 2015-06-23 06:06:20,489 INFO [Thread-693] resourcemanager.ResourceManager (ResourceManager.java:serviceStart(579)) - Recovery ended 2015-06-23 06:06:20,489 DEBUG [Thread-693] service.CompositeService (CompositeService.java:serviceStart(115)) - RMActiveServices: starting services, size=15 2015-06-23 06:06:20,489 INFO [Thread-693] security.RMContainerTokenSecretManager (RMContainerTokenSecretManager.java:rollMasterKey(105)) - Rolling master-key for container-tokens 2015-06-23 06:06:20,4
[jira] [Assigned] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reassigned YARN-2871: --- Assignee: zhihai xu > TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk > - > > Key: YARN-2871 > URL: https://issues.apache.org/jira/browse/YARN-2871 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: zhihai xu >Priority: Minor > > From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): > {code} > Failed tests: > TestRMRestart.testRMRestartGetApplicationList:957 > rMAppManager.logApplicationSummary( > isA(org.apache.hadoop.yarn.api.records.ApplicationId) > ); > Wanted 3 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) > But was 2 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597321#comment-14597321 ] Rakesh R commented on YARN-3798: Sorry, I missed your comment. If curator sync up the data it would be fine. Otherwise there could be a chance of lag like we discussed earlier. Truly I haven't tried Curator yet, probably some one can cross check this part. > ZKRMStateStore shouldn't create new session without occurrance of > SESSIONEXPIED > --- > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena >Priority: Blocker > Attachments: RM.log, YARN-3798-2.7.002.patch, > YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch > > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 10:09:44,887 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > 2015-06-09 10:09:44,887 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > updating appAttempt: appattempt_1433764310492_7152_01 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.mult
[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597275#comment-14597275 ] Tsuyoshi Ozawa commented on YARN-3798: -- Result with test-patch.sh against branch-2.7 is as follows: {quote} $ dev-support/test-patch.sh ../YARN-3798-2.7.002.patch ... -1 overall. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated 48 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version ) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. {quote} javadoc warning is not related to the patch since it doesn't change any signatures and javadocs. > ZKRMStateStore shouldn't create new session without occurrance of > SESSIONEXPIED > --- > > Key: YARN-3798 > URL: https://issues.apache.org/jira/browse/YARN-3798 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Varun Saxena >Priority: Blocker > Attachments: RM.log, YARN-3798-2.7.002.patch, > YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch > > > RM going down with NoNode exception during create of znode for appattempt > *Please find the exception logs* > {code} > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-09 10:09:44,732 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-09 10:09:44,886 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) > at > org.apache.hado
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597268#comment-14597268 ] Varun Saxena commented on YARN-2902: Sorry meant below. 2. On Heartbeat from container localizer, if localizer runner is already stopped, we can indicate the {color:red}container localizer{color} to do the cleanup for downloading resources. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)