[jira] [Updated] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated YARN-3885: -- Attachment: YARN-3885.08.patch > ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 > level > -- > > Key: YARN-3885 > URL: https://issues.apache.org/jira/browse/YARN-3885 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Blocker > Attachments: YARN-3885.02.patch, YARN-3885.03.patch, > YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, > YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch > > > when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} > this piece of code, to calculate {{untoucable}} doesnt consider al the > children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629348#comment-14629348 ] Sunil G commented on YARN-3535: --- Hi [~rohithsharma] and [~peng.zhang] After seeing this patch, I feel there may a synchronization problem. Please correct me if I am wrong. In ContainerRescheduledTransition code, its been used like {code} + container.eventHandler.handle(new ContainerRescheduledEvent(container)); + new FinishedTransition().transition(container, event); {code} Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will process the {{recoverResourceRequestForContainer}} is a separate thread. Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked and it will be processed for closure for this container. If the Scheduler dispatcher is slower in processing due to pending event queue length, there are chances that recoverResourceRequest may not be correct. I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to WAIT_FOR_REQUEST_RECOVERY and scheduler can fire back an event to {{RMContainerImpl}} indicate recovery of resource request is completed. This can move the state forward to KILLED in {{RMContainerImpl}}. Please share your thoughts. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629369#comment-14629369 ] Peng Zhang commented on YARN-3535: -- bq. there are chances that recoverResourceRequest may not be correct. Sorry, I didn't catch this, maybe I missed sth?. I think {{recoverResourceRequest}} will not be affected by whether container finished event is processed faster. Cause {{recoverResourceRequest}} only process the ResourceRequest in container and not care its status. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-3805: --- Attachment: YARN-3805.002.patch I rebased the patch. Thanks for pinging me, [~ozawa]. > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629394#comment-14629394 ] Arun Suresh commented on YARN-3535: --- bq. I think recoverResourceRequest will not be affected by whether container finished event is processed faster. Cause recoverResourceRequest only process the ResourceRequest in container and not care its status. I agree with [~peng.zhang] here. IIUC, The {{recoverResourceRequest}} only affects state of the Scheduler and the SchedulerApp. In any case, the fact that the container is killed (the outcome of the {{RMAppAttemptContainerFinishedEvent}} fired by {{FinishedTransition#transition}}) will be notified to the Scheduler.. and that notification will happen only AFTER the recoverResourceRequest has completed.. since it will be handled by the same dispatcher. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup
[ https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629410#comment-14629410 ] wangfeng commented on YARN-2809: failed when patching this to hadoop2.6.0,console output: patch -u -p0 < YARN-2809-v3.patch patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Hunk #1 succeeded at 984 (offset -16 lines). patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java Hunk #1 FAILED at 22. Hunk #2 succeeded at 33 (offset -4 lines). Hunk #3 succeeded at 71 (offset -5 lines). Hunk #4 succeeded at 105 (offset -5 lines). Hunk #5 succeeded at 266 (offset -10 lines). Hunk #6 succeeded at 338 (offset -10 lines). 1 out of 6 hunks FAILED -- saving rejects to file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java.rej patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java > Implement workaround for linux kernel panic when removing cgroup > > > Key: YARN-2809 > URL: https://issues.apache.org/jira/browse/YARN-2809 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 > Environment: RHEL 6.4 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Fix For: 2.7.0 > > Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch > > > Some older versions of linux have a bug that can cause a kernel panic when > the LCE attempts to remove a cgroup. It is a race condition so it's a bit > rare but on a few thousand node cluster it can result in a couple of panics > per day. > This is the commit that likely (haven't verified) fixes the problem in linux: > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.y&id=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 > Details will be added in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629411#comment-14629411 ] Sunil G commented on YARN-3535: --- Thank you [~peng.zhang] and [~asuresh] for correcting. bq.that notification will happen only AFTER the recoverResourceRequest has completed.. since it will be handled by the same dispatcher Yes. I missed this. Ordering will be corrected here. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3928) launch application master on specific host
[ https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629412#comment-14629412 ] Varun Saxena commented on YARN-3928: Duplicate of MAPREDUCE-6402 > launch application master on specific host > -- > > Key: YARN-3928 > URL: https://issues.apache.org/jira/browse/YARN-3928 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 2.6.0 > Environment: Ubuntu 12.04 >Reporter: Wenrui > > Hi, > Is there a way to launch application master on a specific host ? > If we can not do this in a managed-AM-launcher? > then is it possible to achieve this in unmanaged-AM-launcher? > I just find it's quite necessary to set application master on a specific host > in some scenes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629418#comment-14629418 ] Tsuyoshi Ozawa commented on YARN-3805: -- +1, pending for Jenkins. > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629423#comment-14629423 ] Hadoop QA commented on YARN-3805: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 3m 42s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 59s | Site still builds. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | | | 7m 5s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745590/YARN-3805.002.patch | | Optional Tests | site | | git revision | trunk / 90bda9c | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8556/console | This message was automatically generated. > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629429#comment-14629429 ] Tsuyoshi Ozawa commented on YARN-3805: -- Checking this in. > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature
Dongwook Kwon created YARN-3929: --- Summary: Uncleaning option for local app log files with log-aggregation feature Key: YARN-3929 URL: https://issues.apache.org/jira/browse/YARN-3929 Project: Hadoop YARN Issue Type: New Feature Components: log-aggregation Affects Versions: 2.6.0, 2.4.0 Reporter: Dongwook Kwon Priority: Minor Although it makes sense to delete local app log files once AppLogAggregator copied all files into remote location(HDFS), I have some use cases that need to leave local app log files after it's copied to HDFS. Mostly it's for own backup purpose. I would like to use log-aggregation feature of YARN and want to back up app log files too. Without this option, files has to copy from HDFS to local again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629446#comment-14629446 ] Hadoop QA commented on YARN-3885: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 12s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 46s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 37s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 50s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 18s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 23s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 61m 19s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 99m 23s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745584/YARN-3885.08.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 90bda9c | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8555/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8555/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8555/console | This message was automatically generated. > ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 > level > -- > > Key: YARN-3885 > URL: https://issues.apache.org/jira/browse/YARN-3885 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Blocker > Attachments: YARN-3885.02.patch, YARN-3885.03.patch, > YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, > YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch > > > when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} > this piece of code, to calculate {{untoucable}} doesnt consider al the > children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629452#comment-14629452 ] zhihai xu commented on YARN-3535: - Also because {{containerCompleted}} and {{pullNewlyAllocatedContainersAndNMTokens}} are synchronized, it will guarantee if AM gets the container, {{ContainerRescheduledEvent}}({{recoverResourceRequestForContainer}}) won't be called. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
Dian Fu created YARN-3930: - Summary: FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown Key: YARN-3930 URL: https://issues.apache.org/jira/browse/YARN-3930 Project: Hadoop YARN Issue Type: Sub-task Reporter: Dian Fu Assignee: Dian Fu When I test the node label feature in my local environment, I encountered the following exception: {code} at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {code} The reason is that HDFS throws an exception when calling {{ensureAppendEditlogFile}} because of some reason which causes the edit log output stream isn't closed. This caused that the next time we call {{ensureAppendEditlogFile}}, lease recovery will failed because we are just the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629464#comment-14629464 ] Hudson commented on YARN-3805: -- FAILURE: Integrated in Hadoop-trunk-Commit #8173 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8173/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629465#comment-14629465 ] Hudson commented on YARN-90: FAILURE: Integrated in Hadoop-trunk-Commit #8173 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8173/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu updated YARN-3930: -- Attachment: YARN-3930.001.patch A simple patch attached. > FileSystemNodeLabelsStore should make sure edit log file closed when > exception is thrown > - > > Key: YARN-3930 > URL: https://issues.apache.org/jira/browse/YARN-3930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3930.001.patch > > > When I test the node label feature in my local environment, I encountered the > following exception: > {code} > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > {code} > The reason is that HDFS throws an exception when calling > {{ensureAppendEditlogFile}} because of some reason which causes the edit log > output stream isn't closed. This caused that the next time we call > {{ensureAppendEditlogFile}}, lease recovery will failed because we are just > the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature
[ https://issues.apache.org/jira/browse/YARN-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongwook Kwon updated YARN-3929: Attachment: YARN-3929.01.patch Could you review this patch, Thanks. > Uncleaning option for local app log files with log-aggregation feature > -- > > Key: YARN-3929 > URL: https://issues.apache.org/jira/browse/YARN-3929 > Project: Hadoop YARN > Issue Type: New Feature > Components: log-aggregation >Affects Versions: 2.4.0, 2.6.0 >Reporter: Dongwook Kwon >Priority: Minor > Attachments: YARN-3929.01.patch > > > Although it makes sense to delete local app log files once AppLogAggregator > copied all files into remote location(HDFS), I have some use cases that need > to leave local app log files after it's copied to HDFS. Mostly it's for own > backup purpose. I would like to use log-aggregation feature of YARN and want > to back up app log files too. Without this option, files has to copy from > HDFS to local again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
kyungwan nam created YARN-3931: -- Summary: default-node-label-expression doesn’t apply when an application is submitted by RM rest api Key: YARN-3931 URL: https://issues.apache.org/jira/browse/YARN-3931 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: hadoop-2.6.0 Reporter: kyungwan nam * yarn.scheduler.capacity..default-node-label-expression=large_disk * submit an application using rest api without "app-node-label-expression”, "am-container-node-label-expression” * RM doesn’t allocate containers to the hosts associated with large_disk node label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629489#comment-14629489 ] kyungwan nam commented on YARN-3931: node-label-expression is initialized to empty string {code} ... public ApplicationSubmissionContextInfo() { applicationId = ""; applicationName = ""; containerInfo = new ContainerLaunchContextInfo(); resource = new ResourceInfo(); priority = Priority.UNDEFINED.getPriority(); isUnmanagedAM = false; cancelTokensWhenComplete = true; keepContainers = false; applicationType = ""; tags = new HashSet(); appNodeLabelExpression = ""; amContainerNodeLabelExpression = ""; } {code} but, check whether node-label-expression is null or not {code} // check labels in the resource request. String labelExp = resReq.getNodeLabelExpression(); // if queue has default label expression, and RR doesn't have, use the // default label expression of queue if (labelExp == null && queueInfo != null) { labelExp = queueInfo.getDefaultNodeLabelExpression(); resReq.setNodeLabelExpression(labelExp); } {code} > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: Naganarasimha G R > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-3931: --- Assignee: Naganarasimha G R > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: Naganarasimha G R > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629491#comment-14629491 ] Naganarasimha G R commented on YARN-3931: - Hi [~kyungwan nam], Thanks for raising the issue ... i have assigned this jira to my name but if you are interested to further look into this jira and solve it . please reassign. > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: Naganarasimha G R > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629555#comment-14629555 ] Ajith S commented on YARN-3885: --- not because of the patch > ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 > level > -- > > Key: YARN-3885 > URL: https://issues.apache.org/jira/browse/YARN-3885 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Blocker > Attachments: YARN-3885.02.patch, YARN-3885.03.patch, > YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, > YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch > > > when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} > this piece of code, to calculate {{untoucable}} doesnt consider al the > children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629616#comment-14629616 ] Hudson commented on YARN-3174: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-project/src/site/site.xml > Consolidate the NodeManager and NodeManagerRestart documentation into one > - > > Key: YARN-3174 > URL: https://issues.apache.org/jira/browse/YARN-3174 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Affects Versions: 2.7.1 >Reporter: Allen Wittenauer >Assignee: Masatake Iwasaki > Fix For: 2.8.0 > > Attachments: YARN-3174.001.patch > > > We really don't need a different document for every individual nodemanager > feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629618#comment-14629618 ] Hudson commented on YARN-90: SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629615#comment-14629615 ] Hudson commented on YARN-3805: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629623#comment-14629623 ] Hudson commented on YARN-3174: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/988/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-project/src/site/site.xml * hadoop-yarn-project/CHANGES.txt > Consolidate the NodeManager and NodeManagerRestart documentation into one > - > > Key: YARN-3174 > URL: https://issues.apache.org/jira/browse/YARN-3174 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Affects Versions: 2.7.1 >Reporter: Allen Wittenauer >Assignee: Masatake Iwasaki > Fix For: 2.8.0 > > Attachments: YARN-3174.001.patch > > > We really don't need a different document for every individual nodemanager > feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629625#comment-14629625 ] Hudson commented on YARN-90: SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/988/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629622#comment-14629622 ] Hudson commented on YARN-3805: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/988/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629646#comment-14629646 ] Hadoop QA commented on YARN-3930: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 8s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 39s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 52s | The applied patch generated 2 new checkstyle issues (total was 14, now 15). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 19s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 34s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 56s | Tests passed in hadoop-yarn-common. | | | | 40m 1s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745596/YARN-3930.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 1ba2986 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8557/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8557/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8557/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8557/console | This message was automatically generated. > FileSystemNodeLabelsStore should make sure edit log file closed when > exception is thrown > - > > Key: YARN-3930 > URL: https://issues.apache.org/jira/browse/YARN-3930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3930.001.patch > > > When I test the node label feature in my local environment, I encountered the > following exception: > {code} > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabels
[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629660#comment-14629660 ] Varun Saxena commented on YARN-3877: [~chris.douglas], thanks for the review. Yes, you are correct that this config is not required for test. Will remove it. Will move the relevant test code to a separate test. > YarnClientImpl.submitApplication swallows exceptions > > > Key: YARN-3877 > URL: https://issues.apache.org/jira/browse/YARN-3877 > Project: Hadoop YARN > Issue Type: Improvement > Components: client >Affects Versions: 2.7.2 >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Minor > Attachments: YARN-3877.01.patch > > > When {{YarnClientImpl.submitApplication}} spins waiting for the application > to be accepted, any interruption during its Sleep() calls are logged and > swallowed. > this makes it hard to interrupt the thread during shutdown. Really it should > throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629701#comment-14629701 ] kyungwan nam commented on YARN-3931: hi, i couldn't reassign it to me. i think i don't have the privilege to assign issue > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: Naganarasimha G R > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3877: --- Attachment: YARN-3877.02.patch > YarnClientImpl.submitApplication swallows exceptions > > > Key: YARN-3877 > URL: https://issues.apache.org/jira/browse/YARN-3877 > Project: Hadoop YARN > Issue Type: Improvement > Components: client >Affects Versions: 2.7.2 >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Minor > Attachments: YARN-3877.01.patch, YARN-3877.02.patch > > > When {{YarnClientImpl.submitApplication}} spins waiting for the application > to be accepted, any interruption during its Sleep() calls are logged and > swallowed. > this makes it hard to interrupt the thread during shutdown. Really it should > throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629714#comment-14629714 ] Varun Saxena commented on YARN-3877: [~chris.douglas], updated a new patch. Kindly review. To avoid timing issues in test, added code to wait for thread to enter sleep(enter TIMED_WAITING state) before call to interrupt. > YarnClientImpl.submitApplication swallows exceptions > > > Key: YARN-3877 > URL: https://issues.apache.org/jira/browse/YARN-3877 > Project: Hadoop YARN > Issue Type: Improvement > Components: client >Affects Versions: 2.7.2 >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Minor > Attachments: YARN-3877.01.patch, YARN-3877.02.patch > > > When {{YarnClientImpl.submitApplication}} spins waiting for the application > to be accepted, any interruption during its Sleep() calls are logged and > swallowed. > this makes it hard to interrupt the thread during shutdown. Really it should > throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629722#comment-14629722 ] Hudson commented on YARN-3174: -- ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-yarn-project/CHANGES.txt * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md > Consolidate the NodeManager and NodeManagerRestart documentation into one > - > > Key: YARN-3174 > URL: https://issues.apache.org/jira/browse/YARN-3174 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Affects Versions: 2.7.1 >Reporter: Allen Wittenauer >Assignee: Masatake Iwasaki > Fix For: 2.8.0 > > Attachments: YARN-3174.001.patch > > > We really don't need a different document for every individual nodemanager > feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629728#comment-14629728 ] Hudson commented on YARN-90: ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629724#comment-14629724 ] Hudson commented on YARN-90: ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629716#comment-14629716 ] Hudson commented on YARN-3805: -- ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629715#comment-14629715 ] Hudson commented on YARN-3805: -- ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629719#comment-14629719 ] Hudson commented on YARN-3174: -- ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md > Consolidate the NodeManager and NodeManagerRestart documentation into one > - > > Key: YARN-3174 > URL: https://issues.apache.org/jira/browse/YARN-3174 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Affects Versions: 2.7.1 >Reporter: Allen Wittenauer >Assignee: Masatake Iwasaki > Fix For: 2.8.0 > > Attachments: YARN-3174.001.patch > > > We really don't need a different document for every individual nodemanager > feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90
[ https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629718#comment-14629718 ] Hudson commented on YARN-3805: -- ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt > Update the documentation of Disk Checker based on YARN-90 > - > > Key: YARN-3805 > URL: https://issues.apache.org/jira/browse/YARN-3805 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-3805.001.patch, YARN-3805.002.patch > > > NodeManager is able to recover status of the disk once broken and fixed > without restarting by YARN-90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629723#comment-14629723 ] Hudson commented on YARN-90: ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/]) YARN-3805. Update the documentation of Disk Checker based on YARN-90. Contributed by Masatake Iwasaki. (ozawa: rev 1ba2986dee4bbb64d67ada005f8f132e69575274) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt > NodeManager should identify failed disks becoming good again > > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one
[ https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629717#comment-14629717 ] Hudson commented on YARN-3174: -- ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/]) YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation into one. Contributed by Masatake Iwasaki. (ozawa: rev f02dd146f58bcfa0595eec7f2433bafdd857630f) * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md > Consolidate the NodeManager and NodeManagerRestart documentation into one > - > > Key: YARN-3174 > URL: https://issues.apache.org/jira/browse/YARN-3174 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Affects Versions: 2.7.1 >Reporter: Allen Wittenauer >Assignee: Masatake Iwasaki > Fix For: 2.8.0 > > Attachments: YARN-3174.001.patch > > > We really don't need a different document for every individual nodemanager > feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3928) launch application master on specific host
[ https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629750#comment-14629750 ] Lei Guo commented on YARN-3928: --- [~varun_saxena], I read this JIRA as a host preference requirement during container allocation, it's not a duplicate of MAPREDUCE-6402. [~wenrui], can you confirm? > launch application master on specific host > -- > > Key: YARN-3928 > URL: https://issues.apache.org/jira/browse/YARN-3928 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 2.6.0 > Environment: Ubuntu 12.04 >Reporter: Wenrui > > Hi, > Is there a way to launch application master on a specific host ? > If we can not do this in a managed-AM-launcher? > then is it possible to achieve this in unmanaged-AM-launcher? > I just find it's quite necessary to set application master on a specific host > in some scenes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3928) launch application master on specific host
[ https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629753#comment-14629753 ] Varun Saxena commented on YARN-3928: Oh, then it is not. Misread the JIRA title. Apologies. > launch application master on specific host > -- > > Key: YARN-3928 > URL: https://issues.apache.org/jira/browse/YARN-3928 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 2.6.0 > Environment: Ubuntu 12.04 >Reporter: Wenrui > > Hi, > Is there a way to launch application master on a specific host ? > If we can not do this in a managed-AM-launcher? > then is it possible to achieve this in unmanaged-AM-launcher? > I just find it's quite necessary to set application master on a specific host > in some scenes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629790#comment-14629790 ] Naganarasimha G R commented on YARN-3931: - [~kyungwan nam], Good that you are trying to contribute :), we need to request some committer to add you to the list of contributors but in the mean time you can upload the patch with test case i can help you in reviewing [~wangda tan], Can you please add [~kyungwan nam] to the contributor list and assign him this jira ? > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: Naganarasimha G R > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629811#comment-14629811 ] Hadoop QA commented on YARN-3877: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 34s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 28s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 53s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 55s | Tests passed in hadoop-yarn-client. | | | | 43m 31s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745625/YARN-3877.02.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 1ba2986 | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/8558/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8558/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8558/console | This message was automatically generated. > YarnClientImpl.submitApplication swallows exceptions > > > Key: YARN-3877 > URL: https://issues.apache.org/jira/browse/YARN-3877 > Project: Hadoop YARN > Issue Type: Improvement > Components: client >Affects Versions: 2.7.2 >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Minor > Attachments: YARN-3877.01.patch, YARN-3877.02.patch > > > When {{YarnClientImpl.submitApplication}} spins waiting for the application > to be accepted, any interruption during its Sleep() calls are logged and > swallowed. > this makes it hard to interrupt the thread during shutdown. Really it should > throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)
[ https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3784: -- Attachment: 0002-YARN-3784.patch Uploading a new version of the patch. Initially RM was sending list of container IDs in the preemption message. This patch is now improved that to include timeout also along with container id. New timeout is an optional param in proto. [~chris.douglas] Could you please take a look. > Indicate preemption timout along with the list of containers to AM > (preemption message) > --- > > Key: YARN-3784 > URL: https://issues.apache.org/jira/browse/YARN-3784 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch > > > Currently during preemption, AM is notified with a list of containers which > are marked for preemption. Introducing a timeout duration also along with > this container list so that AM can know how much time it will get to do a > graceful shutdown to its containers (assuming one of preemption policy is > loaded in AM). > This will help in decommissioning NM scenarios, where NM will be > decommissioned after a timeout (also killing containers on it). This timeout > will be helpful to indicate AM that those containers can be killed by RM > forcefully after the timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629940#comment-14629940 ] Ming Ma commented on YARN-2578: --- Thanks [~iwasakims]. Is it similar to HADOOP-11252? Given your latest patch is in hadoop-common, it might be better to fix it as a HADOOP jira instead. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
Bibin A Chundatt created YARN-3932: -- Summary: SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel Key: YARN-3932 URL: https://issues.apache.org/jira/browse/YARN-3932 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Application Resource Report shown wrong when node Label is used. 1.Submit application with NodeLabel 2.Check RM UI for resources used Allocated CPU VCores and Allocated Memory MB is always {{zero}} {code} public synchronized ApplicationResourceUsageReport getResourceUsageReport() { AggregateAppResourceUsage runningResourceUsage = getRunningAggregateAppResourceUsage(); Resource usedResourceClone = Resources.clone(attemptResourceUsage.getUsed()); Resource reservedResourceClone = Resources.clone(attemptResourceUsage.getReserved()); return ApplicationResourceUsageReport.newInstance(liveContainers.size(), reservedContainers.size(), usedResourceClone, reservedResourceClone, Resources.add(usedResourceClone, reservedResourceClone), runningResourceUsage.getMemorySeconds(), runningResourceUsage.getVcoreSeconds()); } {code} should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3932: --- Attachment: ApplicationReport.jpg > SchedulerApplicationAttempt#getResourceUsageReport should be based on > NodeLabel > --- > > Key: YARN-3932 > URL: https://issues.apache.org/jira/browse/YARN-3932 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: ApplicationReport.jpg > > > Application Resource Report shown wrong when node Label is used. > 1.Submit application with NodeLabel > 2.Check RM UI for resources used > Allocated CPU VCores and Allocated Memory MB is always {{zero}} > {code} > public synchronized ApplicationResourceUsageReport getResourceUsageReport() { > AggregateAppResourceUsage runningResourceUsage = > getRunningAggregateAppResourceUsage(); > Resource usedResourceClone = > Resources.clone(attemptResourceUsage.getUsed()); > Resource reservedResourceClone = > Resources.clone(attemptResourceUsage.getReserved()); > return ApplicationResourceUsageReport.newInstance(liveContainers.size(), > reservedContainers.size(), usedResourceClone, reservedResourceClone, > Resources.add(usedResourceClone, reservedResourceClone), > runningResourceUsage.getMemorySeconds(), > runningResourceUsage.getVcoreSeconds()); > } > {code} > should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1644) RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-1644: Attachment: YARN-1644-YARN-1197.4.patch Updated this patch as dependent patch has been updated. > RM-NM protocol changes and NodeStatusUpdater implementation to support > container resizing > - > > Key: YARN-1644 > URL: https://issues.apache.org/jira/browse/YARN-1644 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Wangda Tan >Assignee: MENG DING > Attachments: YARN-1644-YARN-1197.4.patch, YARN-1644.1.patch, > YARN-1644.2.patch, YARN-1644.3.patch, yarn-1644.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature
[ https://issues.apache.org/jira/browse/YARN-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629998#comment-14629998 ] Xuan Gong commented on YARN-3929: - [~dongwook] Does this configuration: yarn.nodemanager.delete.debug-delay-sec satisfy your requirement ? > Uncleaning option for local app log files with log-aggregation feature > -- > > Key: YARN-3929 > URL: https://issues.apache.org/jira/browse/YARN-3929 > Project: Hadoop YARN > Issue Type: New Feature > Components: log-aggregation >Affects Versions: 2.4.0, 2.6.0 >Reporter: Dongwook Kwon >Priority: Minor > Attachments: YARN-3929.01.patch > > > Although it makes sense to delete local app log files once AppLogAggregator > copied all files into remote location(HDFS), I have some use cases that need > to leave local app log files after it's copied to HDFS. Mostly it's for own > backup purpose. I would like to use log-aggregation feature of YARN and want > to back up app log files too. Without this option, files has to copy from > HDFS to local again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3893: Issue Type: Sub-task (was: Bug) Parent: YARN-149 > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3931: - Assignee: kyungwan nam (was: Naganarasimha G R) > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: kyungwan nam > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630010#comment-14630010 ] Wangda Tan commented on YARN-3931: -- Thanks for raising the issue [~kyungwan nam], I just added you to contributor list and assigned the JIRA to you. > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: kyungwan nam > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630017#comment-14630017 ] Wangda Tan commented on YARN-3930: -- [~dian.fu], Thanks for working on the JIRA. Patch looks good, will commit soon. > FileSystemNodeLabelsStore should make sure edit log file closed when > exception is thrown > - > > Key: YARN-3930 > URL: https://issues.apache.org/jira/browse/YARN-3930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3930.001.patch > > > When I test the node label feature in my local environment, I encountered the > following exception: > {code} > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > {code} > The reason is that HDFS throws an exception when calling > {{ensureAppendEditlogFile}} because of some reason which causes the edit log > output stream isn't closed. This caused that the next time we call > {{ensureAppendEditlogFile}}, lease recovery will failed because we are just > the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630022#comment-14630022 ] Wangda Tan commented on YARN-3885: -- Patch LGTM, +1, will commit soon. Thanks [~ajithshetty]. > ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 > level > -- > > Key: YARN-3885 > URL: https://issues.apache.org/jira/browse/YARN-3885 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Blocker > Attachments: YARN-3885.02.patch, YARN-3885.03.patch, > YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, > YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch > > > when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} > this piece of code, to calculate {{untoucable}} doesnt consider al the > children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630042#comment-14630042 ] Wangda Tan commented on YARN-2003: -- Thanks [~sunilg] to update, few more comments regarding the latest patch: - I suggest defer the consideration of queue checking. Currently we're changing how to do queue mapping. Ideally, it should be done before submit to scheduler (maybe before assigning application priority), see YARN-3635. - Assumption of queue will be existed before submit to scheduler may be not always valid. With queue mapping, scheduler can create queue when accepting application. I suggest remove the check of queue's existence. Instead, you can have a private method to get priority by queue name. If queue is not existed, you can assign default priority to application. - Comparison of priority should use Priority.compareTo instead of ">/<". > Support for Application priority : Changes in RM and Capacity Scheduler > --- > > Key: YARN-2003 > URL: https://issues.apache.org/jira/browse/YARN-2003 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, > 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, > 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, > 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, > 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, > 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, > 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, > 0021-YARN-2003.patch, 0022-YARN-2003.patch > > > AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from > Submission Context and store. > Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630090#comment-14630090 ] Bibin A Chundatt commented on YARN-3932: Hi [~leftnoteasy] i think we should iterate over {{liveContainers}} get sum of resource used. Any thoughts?? > SchedulerApplicationAttempt#getResourceUsageReport should be based on > NodeLabel > --- > > Key: YARN-3932 > URL: https://issues.apache.org/jira/browse/YARN-3932 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: ApplicationReport.jpg > > > Application Resource Report shown wrong when node Label is used. > 1.Submit application with NodeLabel > 2.Check RM UI for resources used > Allocated CPU VCores and Allocated Memory MB is always {{zero}} > {code} > public synchronized ApplicationResourceUsageReport getResourceUsageReport() { > AggregateAppResourceUsage runningResourceUsage = > getRunningAggregateAppResourceUsage(); > Resource usedResourceClone = > Resources.clone(attemptResourceUsage.getUsed()); > Resource reservedResourceClone = > Resources.clone(attemptResourceUsage.getReserved()); > return ApplicationResourceUsageReport.newInstance(liveContainers.size(), > reservedContainers.size(), usedResourceClone, reservedResourceClone, > Resources.add(usedResourceClone, reservedResourceClone), > runningResourceUsage.getMemorySeconds(), > runningResourceUsage.getVcoreSeconds()); > } > {code} > should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3914) Entity created time should be part of the row key of entity table
[ https://issues.apache.org/jira/browse/YARN-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630125#comment-14630125 ] Sangjin Lee commented on YARN-3914: --- [~zjshen], we have been discussing this. While adding entity creation time to the row key may solve this problem, the concern is that it may introduce others. If the row key is (user/cluster/flow/run/app_id/entity_type/created_time/entity_id), then even the most basic query for (entity_type + entity_id) will get much more complicated, right? We cannot expect readers to provide the creation time every time they query for an entity by id. Also, as you said, we cannot always accommodate different query vectors by adding more to the row key, or we would be risking blowing up the row key size or breaking other queries. We should be real judicious what goes into the row key... I think it's reasonable to expect that the entity id order would be either completely or nearly identical to the chronological order (e.g. app id, or container id). So perhaps we could rely on the entity id order to help mitigate this problem. Thoughts? > Entity created time should be part of the row key of entity table > - > > Key: YARN-3914 > URL: https://issues.apache.org/jira/browse/YARN-3914 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Entity created time should be part of the row key of entity table, between > entity type and entity Id. The reason to have it is to index the entities. > Though we cannot index the entities for all kinds of information, indexing > them according to the created time is very necessary. Without it, every query > for the latest entities that belong to an application and a type will scan > through all the entities that belong to them. For example, if we want to list > the 100 latest started containers in an YARN app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3635) Get-queue-mapping should be a common interface of YarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630139#comment-14630139 ] Wangda Tan commented on YARN-3635: -- Hi [~sandyr], Thanks for your comments, actually I have read QueuePlacementPolicy/QueuePlacementRule from FS before working on this patch. The generic design fo this patch is based on FS's queue placement policy structure, but also with some changes. To your comments: bq. Is a common way of configuration proposed? No common configuration, it only defined a set of common interfaces. Since FS/CS have very different ways to configuration, now rules are created by different schedulers, see CapacityScheduler#updatePlacementRules as an example. bq. What steps are required for the Fair Scheduler to integrate with this? 1) Port existing rules to new APIs defined in the patch, this should be simple 2) Change configuration implementation to instance new defined PlacementRule, you may not need to change existing configuration items itself. 3) Change FS workflow, with this patch, queue mapping is happened before submit to scheduler. Remove queue mapping related logics from FS and create queue if needed. bq. Each placement rule gets the chance to assign the app to a queue, reject the app, or pass. If it passes, the next rule gets a chance. New APIs are very similar: Non-null is determined Null is not determined Throw exception when rejected. You can take a look at {{org.apache.hadoop.yarn.server.resourcemanager.placement.PlacementRule}} bq. A placement rule can base its decision on: bq. Yes you can do all of them with the new API except "The set of queues given in the Fair Scheduler configuration.": I was thinking necessarity of passing set of queues in the interface. In existing implementations of QueuePlacementPolicy, FS queues are only used to check mapped queue's existence. I would prefer to delay the check to submit to scheduler. See my next comment about "create" flag for more details. Another reason of not passing queue names set via interface is, queues are very dynamic. For example, if user wants to submit application to queue with lowest utilization, queue names set may not enough. I would prefer to let rule choose to get what need from scheduler. bq. Rules are marked as "terminal" if they will never pass. This helps to avoid misconfigurations where users place rules after terminal rules. I'm not sure if is it necessary. I think terminal or not should be determined by runtime, but I'm OK if you think it's must to have. bq. Rules have a "create" attribute which determines whether they can create a new queue or whether they must assign to existing queues. I think queue is create-able or not should be determined by scheduler, it should be a part of scheduler configuration instead of rule itself. You can put "create" to your implemented rules without any issue, but I prefer not to expose it to public interface. bq. Currently the set of placement rules is limited to what's implemented in YARN. I.e. there's no public pluggable rule support. Agree, this is one thing we need to do in the future. For now, we can make queue mapping happens in a central place first. bq. Are there places where my summary would not describe what's going on in this patch? I think it should covers most of my patch, you can also take a look at my patch to see if anything unexpected :). > Get-queue-mapping should be a common interface of YarnScheduler > --- > > Key: YARN-3635 > URL: https://issues.apache.org/jira/browse/YARN-3635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Reporter: Wangda Tan >Assignee: Tan, Wangda > Attachments: YARN-3635.1.patch, YARN-3635.2.patch, YARN-3635.3.patch, > YARN-3635.4.patch, YARN-3635.5.patch, YARN-3635.6.patch > > > Currently, both of fair/capacity scheduler support queue mapping, which makes > scheduler can change queue of an application after submitted to scheduler. > One issue of doing this in specific scheduler is: If the queue after mapping > has different maximum_allocation/default-node-label-expression of the > original queue, {{validateAndCreateResourceRequest}} in RMAppManager checks > the wrong queue. > I propose to make the queue mapping as a common interface of scheduler, and > RMAppManager set the queue after mapping before doing validations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630151#comment-14630151 ] Hadoop QA commented on YARN-433: \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12740222/YARN-433.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 1ba2986 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8559/console | This message was automatically generated. > When RM is catching up with node updates then it should not expire acquired > containers > -- > > Key: YARN-433 > URL: https://issues.apache.org/jira/browse/YARN-433 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Xuan Gong > Attachments: YARN-433.1.patch, YARN-433.2.patch > > > RM expires containers that are not launched within some time of being > allocated. The default is 10mins. When an RM is not keeping up with node > updates then it may not be aware of new launched containers. If the expire > thread fires for such containers then the RM can expire them even though they > may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)
[ https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630167#comment-14630167 ] Wangda Tan commented on YARN-3784: -- Beyond timeout, another thing we may need consider is: after a container is removed from to-be-preempted list, should we notify scheduler/AM about that? This could happen if other applications release containers, or other queues/applications cancel resource requests. Now proportionalCPP can notify scheduler many times for a same container, if we have to-preempt/remove-from-to-preempt event, we can also reduce number of messages send to scheduler (which could cause YARN-3508 happens). > Indicate preemption timout along with the list of containers to AM > (preemption message) > --- > > Key: YARN-3784 > URL: https://issues.apache.org/jira/browse/YARN-3784 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch > > > Currently during preemption, AM is notified with a list of containers which > are marked for preemption. Introducing a timeout duration also along with > this container list so that AM can know how much time it will get to do a > graceful shutdown to its containers (assuming one of preemption policy is > loaded in AM). > This will help in decommissioning NM scenarios, where NM will be > decommissioned after a timeout (also killing containers on it). This timeout > will be helpful to indicate AM that those containers can be killed by RM > forcefully after the timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it
[ https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630209#comment-14630209 ] Anubhav Dhoot commented on YARN-3900: - This is needed for YARN-3736. Without this the leveldb state store implementation of YARN-3736 actually causes a dump > Protobuf layout of yarn_security_token causes errors in other protos that > include it > - > > Key: YARN-3900 > URL: https://issues.apache.org/jira/browse/YARN-3900 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3900.001.patch, YARN-3900.001.patch > > > Because of the subdirectory server used in > {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} > there are errors in other protos that include them. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630251#comment-14630251 ] Subru Krishnan commented on YARN-3656: -- Thanks [~asuresh] for reviewing the patch. We did consider allowing declarative plugging of planners during the early stages of development but decided against it to keep the code base simpler to make it easier to grok as the current algorithms themselves are non-trivial. We are open to doing this in the future as & when the need arises. > LowCost: A Cost-Based Placement Agent for YARN Reservations > --- > > Key: YARN-3656 > URL: https://issues.apache.org/jira/browse/YARN-3656 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Ishai Menache >Assignee: Jonathan Yaniv > Labels: capacity-scheduler, resourcemanager > Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, > YARN-3656-v1.2.patch, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf > > > YARN-1051 enables SLA support by allowing users to reserve cluster capacity > ahead of time. YARN-1710 introduced a greedy agent for placing user > reservations. The greedy agent makes fast placement decisions but at the cost > of ignoring the cluster committed resources, which might result in blocking > the cluster resources for certain periods of time, and in turn rejecting some > arriving jobs. > We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” > the demand of the job throughout the allowed time-window according to a > global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-433: --- Attachment: YARN-433.3.patch rebase the patch > When RM is catching up with node updates then it should not expire acquired > containers > -- > > Key: YARN-433 > URL: https://issues.apache.org/jira/browse/YARN-433 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Xuan Gong > Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch > > > RM expires containers that are not launched within some time of being > allocated. The default is 10mins. When an RM is not keeping up with node > updates then it may not be aware of new launched containers. If the expire > thread fires for such containers then the RM can expire them even though they > may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3868) ContainerManager recovery for container resizing
[ https://issues.apache.org/jira/browse/YARN-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-3868: Attachment: YARN-3868-YARN-1197.3.patch Update patch as dependent patches have been updated. > ContainerManager recovery for container resizing > > > Key: YARN-3868 > URL: https://issues.apache.org/jira/browse/YARN-3868 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-3868-YARN-1197.3.patch, YARN-3868.1.patch, > YARN-3868.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630275#comment-14630275 ] Joep Rottinghuis commented on YARN-3908: bq. In fact, I'm wondering if we should but info and events into a separate column family like what we did for configs/metrics? In principle we should keep everything in the same column family (fewer store files) unless: a) The items that we store require a different TTL, compression, etc. This is the case for metrics where we need a separate TTL. b) The columns are rather significant in size, and in many queries they'll be skipped (and specifically not used in push-down predicate ie. column value filters etc). This is the case for configuration. If we have many queries to just retrieve info fields and we skip configs in these, then iterating over just the rows in the info column family will have a benefit of not needing to access the config store files. Otherwise a separate column family just results in more store files and doesn't really gain us anything. Given the current code setup, switching column family is almost trivial, so given that there are no functionality differences, I'd say let's not even try to further optimize this until we have way more code in place. Then we can run large batches of historical job history files and other inputs (perhaps porting data from ATS v1) and then we can see the potential benefit or downside. The other reason to not do premature optimization is that I'm still thinking of adding a few more perf tweaks. Those would also just be performance optimizations, and not any functionality different, so also not a priority now. We should look at tuning all those things much later and together in a coherent way. Additional settings that we need to test are RPC compression, encoding of the store files and/or compression of the same. In short, let's focus on completing functionality and then tinker with these settings later. > Bugs in HBaseTimelineWriterImpl > --- > > Key: YARN-3908 > URL: https://issues.apache.org/jira/browse/YARN-3908 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Vrushali C > Attachments: YARN-3908-YARN-2928.001.patch, > YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch > > > 1. In HBaseTimelineWriterImpl, the info column family contains the basic > fields of a timeline entity plus events. However, entity#info map is not > stored at all. > 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel
[ https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630283#comment-14630283 ] Wangda Tan commented on YARN-3932: -- [~bibinchundatt], I think we can add a method such as getTotalUsed in ResourceUsage class, which will be more efficient than iterating all liveContainers. This can be done in the near term. To make it correct, I think we need to return usage-by-partition object to application, which requires to change APIs. > SchedulerApplicationAttempt#getResourceUsageReport should be based on > NodeLabel > --- > > Key: YARN-3932 > URL: https://issues.apache.org/jira/browse/YARN-3932 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: ApplicationReport.jpg > > > Application Resource Report shown wrong when node Label is used. > 1.Submit application with NodeLabel > 2.Check RM UI for resources used > Allocated CPU VCores and Allocated Memory MB is always {{zero}} > {code} > public synchronized ApplicationResourceUsageReport getResourceUsageReport() { > AggregateAppResourceUsage runningResourceUsage = > getRunningAggregateAppResourceUsage(); > Resource usedResourceClone = > Resources.clone(attemptResourceUsage.getUsed()); > Resource reservedResourceClone = > Resources.clone(attemptResourceUsage.getReserved()); > return ApplicationResourceUsageReport.newInstance(liveContainers.size(), > reservedContainers.size(), usedResourceClone, reservedResourceClone, > Resources.add(usedResourceClone, reservedResourceClone), > runningResourceUsage.getMemorySeconds(), > runningResourceUsage.getVcoreSeconds()); > } > {code} > should be {{attemptResourceUsage.getUsed(label)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630289#comment-14630289 ] Joep Rottinghuis commented on YARN-3908: Patch looks good with one comment. I completely overlooked the event info map, because it isn't part of the javadoc on the EntityTable. I should have double-checked but didn't. Thanks for catching this. [~sjlee0] I think it would be good to update the javadoc that describes the EntityTable in the EntityTable.java file. The same is probably missing from the doc "Timeline service schema for native HBase tables" (not sure which jira the PDF for that doc is attached to), because I copied it from the code. I don't think that the application table has been copied yet, so it won't be missing from there. > Bugs in HBaseTimelineWriterImpl > --- > > Key: YARN-3908 > URL: https://issues.apache.org/jira/browse/YARN-3908 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Vrushali C > Attachments: YARN-3908-YARN-2928.001.patch, > YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch > > > 1. In HBaseTimelineWriterImpl, the info column family contains the basic > fields of a timeline entity plus events. However, entity#info map is not > stored at all. > 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3908: -- Attachment: YARN-3908-YARN-2928.004.patch v.4 patch posted Thanks for your feedback [~jrottinghuis]! I corrected the {{EventTable}} javadoc to add the info key/value and the event timestamp. I also changed {{ColumnHelper.readTimeseriesResults()}} to use a different generic type (V) not to be confused with the main class type (T). > Bugs in HBaseTimelineWriterImpl > --- > > Key: YARN-3908 > URL: https://issues.apache.org/jira/browse/YARN-3908 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Vrushali C > Attachments: YARN-3908-YARN-2928.001.patch, > YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, > YARN-3908-YARN-2928.004.patch > > > 1. In HBaseTimelineWriterImpl, the info column family contains the basic > fields of a timeline entity plus events. However, entity#info map is not > stored at all. > 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3908: -- Attachment: YARN-3908-YARN-2928.004.patch Oops. Forgot ColumnPrefix. > Bugs in HBaseTimelineWriterImpl > --- > > Key: YARN-3908 > URL: https://issues.apache.org/jira/browse/YARN-3908 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Vrushali C > Attachments: YARN-3908-YARN-2928.001.patch, > YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, > YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch > > > 1. In HBaseTimelineWriterImpl, the info column family contains the basic > fields of a timeline entity plus events. However, entity#info map is not > stored at all. > 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-3905: - Attachment: YARN-3905.001.patch > Application History Server UI NPEs when accessing apps run after RM restart > --- > > Key: YARN-3905 > URL: https://issues.apache.org/jira/browse/YARN-3905 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3905.001.patch > > > From the Application History URL (http://RmHostName:8188/applicationhistory), > clicking on the application ID of an app that was run after the RM daemon has > been restarted results in a 500 error: > {noformat} > Sorry, got error 500 > Please consult RFC 2616 for meanings of the error code. > {noformat} > The stack trace is as follows: > {code} > 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO > applicationhistoryservice.FileSystemApplicationHistoryStore: Completed > reading history information of all application attempts of application > application_1436472584878_0001 > 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: > Failed to read the AM container of the application attempt > appattempt_1436472584878_0001_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-3905: -- Target Version/s: 2.7.1, (was: 2.7.1) > Application History Server UI NPEs when accessing apps run after RM restart > --- > > Key: YARN-3905 > URL: https://issues.apache.org/jira/browse/YARN-3905 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3905.001.patch > > > From the Application History URL (http://RmHostName:8188/applicationhistory), > clicking on the application ID of an app that was run after the RM daemon has > been restarted results in a 500 error: > {noformat} > Sorry, got error 500 > Please consult RFC 2616 for meanings of the error code. > {noformat} > The stack trace is as follows: > {code} > 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO > applicationhistoryservice.FileSystemApplicationHistoryStore: Completed > reading history information of all application attempts of application > application_1436472584878_0001 > 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: > Failed to read the AM container of the application attempt > appattempt_1436472584878_0001_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630452#comment-14630452 ] Jonathan Eagles commented on YARN-3905: --- +1. [~eepayne], retargetting for 2.7.2 since 2.7.1 is already released. > Application History Server UI NPEs when accessing apps run after RM restart > --- > > Key: YARN-3905 > URL: https://issues.apache.org/jira/browse/YARN-3905 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3905.001.patch > > > From the Application History URL (http://RmHostName:8188/applicationhistory), > clicking on the application ID of an app that was run after the RM daemon has > been restarted results in a 500 error: > {noformat} > Sorry, got error 500 > Please consult RFC 2616 for meanings of the error code. > {noformat} > The stack trace is as follows: > {code} > 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO > applicationhistoryservice.FileSystemApplicationHistoryStore: Completed > reading history information of all application attempts of application > application_1436472584878_0001 > 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: > Failed to read the AM container of the application attempt > appattempt_1436472584878_0001_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-3905: -- Target Version/s: 2.7.2 (was: 2.7.1) > Application History Server UI NPEs when accessing apps run after RM restart > --- > > Key: YARN-3905 > URL: https://issues.apache.org/jira/browse/YARN-3905 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3905.001.patch > > > From the Application History URL (http://RmHostName:8188/applicationhistory), > clicking on the application ID of an app that was run after the RM daemon has > been restarted results in a 500 error: > {noformat} > Sorry, got error 500 > Please consult RFC 2616 for meanings of the error code. > {noformat} > The stack trace is as follows: > {code} > 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO > applicationhistoryservice.FileSystemApplicationHistoryStore: Completed > reading history information of all application attempts of application > application_1436472584878_0001 > 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: > Failed to read the AM container of the application attempt > appattempt_1436472584878_0001_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
[ https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630472#comment-14630472 ] Hadoop QA commented on YARN-3905: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 14s | Pre-patch trunk has 6 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 29s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 23s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 37s | The applied patch generated 1 new checkstyle issues (total was 39, now 40). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 23s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 9s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-server-common. | | | | 40m 39s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745708/YARN-3905.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0bda84f | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/diffcheckstylehadoop-yarn-server-common.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8562/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8562/console | This message was automatically generated. > Application History Server UI NPEs when accessing apps run after RM restart > --- > > Key: YARN-3905 > URL: https://issues.apache.org/jira/browse/YARN-3905 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-3905.001.patch > > > From the Application History URL (http://RmHostName:8188/applicationhistory), > clicking on the application ID of an app that was run after the RM daemon has > been restarted results in a 500 error: > {noformat} > Sorry, got error 500 > Please consult RFC 2616 for meanings of the error code. > {noformat} > The stack trace is as follows: > {code} > 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO > applicationhistoryservice.FileSystemApplicationHistoryStore: Completed > reading history information of all application attempts of application > application_1436472584878_0001 > 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: > Failed to read the AM container of the application attempt > appattempt_1436472584878_0001_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apach
[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630473#comment-14630473 ] Hadoop QA commented on YARN-433: \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 52s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 9m 9s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 9s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 24s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 27s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 30s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 55m 12s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 94m 44s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745685/YARN-433.3.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0bda84f | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8560/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8560/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8560/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8560/console | This message was automatically generated. > When RM is catching up with node updates then it should not expire acquired > containers > -- > > Key: YARN-433 > URL: https://issues.apache.org/jira/browse/YARN-433 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Xuan Gong > Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch > > > RM expires containers that are not launched within some time of being > allocated. The default is 10mins. When an RM is not keeping up with node > updates then it may not be aware of new launched containers. If the expire > thread fires for such containers then the RM can expire them even though they > may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630479#comment-14630479 ] Hadoop QA commented on YARN-3908: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 5s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 23s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 23s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 22s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 41s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 26s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 21s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 45m 2s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745704/YARN-3908-YARN-2928.004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / eb1932d | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8563/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8563/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8563/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8563/console | This message was automatically generated. > Bugs in HBaseTimelineWriterImpl > --- > > Key: YARN-3908 > URL: https://issues.apache.org/jira/browse/YARN-3908 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Vrushali C > Attachments: YARN-3908-YARN-2928.001.patch, > YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, > YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch > > > 1. In HBaseTimelineWriterImpl, the info column family contains the basic > fields of a timeline entity plus events. However, entity#info map is not > stored at all. > 2 event#timestamp is also not persisted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630500#comment-14630500 ] Hudson commented on YARN-3930: -- FAILURE: Integrated in Hadoop-trunk-Commit #8176 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8176/]) YARN-3930. FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown. (Dian Fu via wangda) (wangda: rev fa2b63ed162410ba05eadf211a1da068351b293a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/CHANGES.txt > FileSystemNodeLabelsStore should make sure edit log file closed when > exception is thrown > - > > Key: YARN-3930 > URL: https://issues.apache.org/jira/browse/YARN-3930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Dian Fu >Assignee: Dian Fu > Fix For: 2.8.0 > > Attachments: YARN-3930.001.patch > > > When I test the node label feature in my local environment, I encountered the > following exception: > {code} > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > {code} > The reason is that HDFS throws an exception when calling > {{ensureAppendEditlogFile}} because of some reason which causes the edit log > output stream isn't closed. This caused that the next time we call > {{ensureAppendEditlogFile}}, lease recovery will failed because we are just > the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630509#comment-14630509 ] Hudson commented on YARN-3885: -- FAILURE: Integrated in Hadoop-trunk-Commit #8177 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8177/]) YARN-3885. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level. (Ajith S via wangda) (wangda: rev 3540d5fe4b1da942ea80c9e7ca1126b1abb8a68a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/CHANGES.txt > ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 > level > -- > > Key: YARN-3885 > URL: https://issues.apache.org/jira/browse/YARN-3885 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Blocker > Fix For: 2.8.0 > > Attachments: YARN-3885.02.patch, YARN-3885.03.patch, > YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, > YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch > > > when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} > this piece of code, to calculate {{untoucable}} doesnt consider al the > children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630532#comment-14630532 ] Masatake Iwasaki commented on YARN-2578: Yes, it is the same fix. I agree it should fixed in hadoop-common JIRA. Thanks. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it
[ https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3900: Description: Because of the subdirectory server used in {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} there are errors in other protos that include them. As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html {noformat} Any subdirectories under src/main/proto/ are treated as package structure for protobuf definition imports.{noformat} was: Because of the subdirectory server used in {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} there are errors in other protos that include them. > Protobuf layout of yarn_security_token causes errors in other protos that > include it > - > > Key: YARN-3900 > URL: https://issues.apache.org/jira/browse/YARN-3900 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3900.001.patch, YARN-3900.001.patch > > > Because of the subdirectory server used in > {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} > there are errors in other protos that include them. > As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html > {noformat} Any subdirectories under src/main/proto/ are treated as package > structure for protobuf definition imports.{noformat} > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630535#comment-14630535 ] Masatake Iwasaki commented on YARN-2578: bq. 2. Would you tell me why Client.getRpcTimeout returns 0 if ipc.client.ping is false? Just to make it clear that the timeout has no effect without setting {{ipc.client.ping}} to true. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it
[ https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3900: Attachment: YARN-3900.002.patch Updated patch for recent changes in ContainerTokenIdentifierProto > Protobuf layout of yarn_security_token causes errors in other protos that > include it > - > > Key: YARN-3900 > URL: https://issues.apache.org/jira/browse/YARN-3900 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3900.001.patch, YARN-3900.001.patch, > YARN-3900.002.patch > > > Because of the subdirectory server used in > {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} > there are errors in other protos that include them. > As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html > {noformat} Any subdirectories under src/main/proto/ are treated as package > structure for protobuf definition imports.{noformat} > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630593#comment-14630593 ] Sangjin Lee commented on YARN-3906: --- The bulk of the work is done, but I'd like to wait until YARN-3908 is committed and update the changes. > split the application table from the entity table > - > > Key: YARN-3906 > URL: https://issues.apache.org/jira/browse/YARN-3906 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > > Per discussions on YARN-3815, we need to split the application entities from > the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown
[ https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630615#comment-14630615 ] Dian Fu commented on YARN-3930: --- Thanks [~leftnoteasy] for review and commit. > FileSystemNodeLabelsStore should make sure edit log file closed when > exception is thrown > - > > Key: YARN-3930 > URL: https://issues.apache.org/jira/browse/YARN-3930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Dian Fu >Assignee: Dian Fu > Fix For: 2.8.0 > > Attachments: YARN-3930.001.patch > > > When I test the node label feature in my local environment, I encountered the > following exception: > {code} > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > {code} > The reason is that HDFS throws an exception when calling > {{ensureAppendEditlogFile}} because of some reason which causes the edit log > output stream isn't closed. This caused that the next time we call > {{ensureAppendEditlogFile}}, lease recovery will failed because we are just > the lease holder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-3931: --- Attachment: YARN-3931.001.patch I attached the patch. it work well in my cluster... :) > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: kyungwan nam > Attachments: YARN-3931.001.patch > > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it
[ https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630643#comment-14630643 ] Hadoop QA commented on YARN-3900: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 51s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 43s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 44s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 36s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 59s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 3m 11s | Tests passed in hadoop-yarn-server-applicationhistoryservice. | | {color:green}+1{color} | yarn tests | 51m 8s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 97m 24s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745719/YARN-3900.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 3540d5f | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8564/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-applicationhistoryservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8564/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8564/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8564/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8564/console | This message was automatically generated. > Protobuf layout of yarn_security_token causes errors in other protos that > include it > - > > Key: YARN-3900 > URL: https://issues.apache.org/jira/browse/YARN-3900 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-3900.001.patch, YARN-3900.001.patch, > YARN-3900.002.patch > > > Because of the subdirectory server used in > {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} > there are errors in other protos that include them. > As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html > {noformat} Any subdirectories under src/main/proto/ are treated as package > structure for protobuf definition imports.{noformat} > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api
[ https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630659#comment-14630659 ] Xianyin Xin commented on YARN-3931: --- This reminds me an earlier trouble i have met. Hi [~Naganarasimha], can we consider to remove the "" node label expression in the code? It seems not make sense we set a node label as "". For node label expression, it should be "some_label" or null. Just an unrigorous thoughts, what do you think? > default-node-label-expression doesn’t apply when an application is submitted > by RM rest api > --- > > Key: YARN-3931 > URL: https://issues.apache.org/jira/browse/YARN-3931 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: hadoop-2.6.0 >Reporter: kyungwan nam >Assignee: kyungwan nam > Attachments: YARN-3931.001.patch > > > * > yarn.scheduler.capacity..default-node-label-expression=large_disk > * submit an application using rest api without "app-node-label-expression”, > "am-container-node-label-expression” > * RM doesn’t allocate containers to the hosts associated with large_disk node > label -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level
[ https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630669#comment-14630669 ] Ajith S commented on YARN-3885: --- Thanks [~leftnoteasy] , [~xinxianyin] and [~sunilg] :) > ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 > level > -- > > Key: YARN-3885 > URL: https://issues.apache.org/jira/browse/YARN-3885 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Blocker > Fix For: 2.8.0 > > Attachments: YARN-3885.02.patch, YARN-3885.03.patch, > YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, > YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch > > > when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}} > this piece of code, to calculate {{untoucable}} doesnt consider al the > children, it considers only immediate childern -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3736) Persist the Plan information, ie. accepted reservations to the RMStateStore for failover
[ https://issues.apache.org/jira/browse/YARN-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3736: Attachment: YARN-3736.001.patch Patch that adds implementation of ReservationSystem state to all the state stores. Actually persisting information is next > Persist the Plan information, ie. accepted reservations to the RMStateStore > for failover > > > Key: YARN-3736 > URL: https://issues.apache.org/jira/browse/YARN-3736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Anubhav Dhoot > Attachments: YARN-3736.001.patch > > > We need to persist the current state of the plan, i.e. the accepted > ReservationAllocations & corresponding RLESpareseResourceAllocations to the > RMStateStore so that we can recover them on RM failover. This involves making > all the reservation system data structures protobuf friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3897) "Too many links" in NM log dir
[ https://issues.apache.org/jira/browse/YARN-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-3897: -- Description: Users need to left container logs more than one day. On some nodes of our busy cluster, the number of subdirs of {yarn.nodemanager.log-dirs} may reach 32000, which is the defaul limit of ext3 file system. As a result, we got errors when initiating containers: "Failed to create directory {yarn.nodemanager.log-dirs}/application_1435111082717_1341740 - Too many links" log aggregation is not an option for us because of the heavy pressure on namenode. With a cluster of 5K nodes and 20k log files per node, it's not acceptable to aggregate so many files to hdfs. Since ext3 is still widely used, we'd better do something to avoid such error. was: Users need to left container logs more than one day. On some nodes of our busy cluster, the number of subdirs of {yarn.nodemanager.log-dirs} may reach 32000, which is the defaul limit of ext3 file system. As a result, we got errors when initiating containers: "Failed to create directory {yarn.nodemanager.log-dirs}/logs/application_1435111082717_1341740 - Too many links" log aggregation is not an option for us because of the heavy pressure on namenode. With a cluster of 5K nodes and 20k log files per node, it's not acceptable to aggregate so many files to hdfs. Since ext3 is still widely used, we'd better do something to avoid such error. > "Too many links" in NM log dir > -- > > Key: YARN-3897 > URL: https://issues.apache.org/jira/browse/YARN-3897 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Minor > > Users need to left container logs more than one day. On some nodes of our > busy cluster, the number of subdirs of {yarn.nodemanager.log-dirs} may reach > 32000, which is the defaul limit of ext3 file system. As a result, we got > errors when initiating containers: > "Failed to create directory > {yarn.nodemanager.log-dirs}/application_1435111082717_1341740 - Too many > links" > log aggregation is not an option for us because of the heavy pressure on > namenode. With a cluster of 5K nodes and 20k log files per node, it's not > acceptable to aggregate so many files to hdfs. > Since ext3 is still widely used, we'd better do something to avoid such error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3736) Persist the Plan information, ie. accepted reservations to the RMStateStore for failover
[ https://issues.apache.org/jira/browse/YARN-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3736: Attachment: YARN-3736.001.patch > Persist the Plan information, ie. accepted reservations to the RMStateStore > for failover > > > Key: YARN-3736 > URL: https://issues.apache.org/jira/browse/YARN-3736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Anubhav Dhoot > Attachments: YARN-3736.001.patch, YARN-3736.001.patch > > > We need to persist the current state of the plan, i.e. the accepted > ReservationAllocations & corresponding RLESpareseResourceAllocations to the > RMStateStore so that we can recover them on RM failover. This involves making > all the reservation system data structures protobuf friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630688#comment-14630688 ] Hong Zhiguo commented on YARN-2768: --- [~kasha], could you please review the patch? > optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% > of computing time of update thread > > > Key: YARN-2768 > URL: https://issues.apache.org/jira/browse/YARN-2768 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Minor > Attachments: YARN-2768.patch, profiling_FairScheduler_update.png > > > See the attached picture of profiling result. The clone of Resource object > within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the > function FairScheduler.update(). > The code of FSAppAttempt.updateDemand: > {code} > public void updateDemand() { > demand = Resources.createResource(0); > // Demand is current consumption plus outstanding requests > Resources.addTo(demand, app.getCurrentConsumption()); > // Add up outstanding resource requests > synchronized (app) { > for (Priority p : app.getPriorities()) { > for (ResourceRequest r : app.getResourceRequests(p).values()) { > Resource total = Resources.multiply(r.getCapability(), > r.getNumContainers()); > Resources.addTo(demand, total); > } > } > } > } > {code} > The code of Resources.multiply: > {code} > public static Resource multiply(Resource lhs, double by) { > return multiplyTo(clone(lhs), by); > } > {code} > The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630694#comment-14630694 ] Hong Zhiguo commented on YARN-2306: --- hi, [~rchiang], do you mean running the unit test in patch againt trunk? > leak of reservation metrics (fair scheduler) > > > Key: YARN-2306 > URL: https://issues.apache.org/jira/browse/YARN-2306 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Minor > Attachments: YARN-2306-2.patch, YARN-2306.patch > > > This only applies to fair scheduler. Capacity scheduler is OK. > When appAttempt or node is removed, the metrics for > reservation(reservedContainers, reservedMB, reservedVCores) is not reduced > back. > These are important metrics for administrator. The wrong metrics confuses may > confuse them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3845) [YARN] YARN status in web ui does not show correctly in IE 11
[ https://issues.apache.org/jira/browse/YARN-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3845: --- Attachment: YARN-3845.patch > [YARN] YARN status in web ui does not show correctly in IE 11 > - > > Key: YARN-3845 > URL: https://issues.apache.org/jira/browse/YARN-3845 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jagadesh Kiran N >Assignee: Mohammad Shahid Khan >Priority: Trivial > Attachments: IE11_yarn.gif, YARN-3845.patch > > > In IE 11 , the color display is not proper for the scheduler . In other > browser it is showing correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630733#comment-14630733 ] Ray Chiang commented on YARN-2306: -- Heh. That was two months ago. I believe I was referring to the unit test. > leak of reservation metrics (fair scheduler) > > > Key: YARN-2306 > URL: https://issues.apache.org/jira/browse/YARN-2306 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Minor > Attachments: YARN-2306-2.patch, YARN-2306.patch > > > This only applies to fair scheduler. Capacity scheduler is OK. > When appAttempt or node is removed, the metrics for > reservation(reservedContainers, reservedMB, reservedVCores) is not reduced > back. > These are important metrics for administrator. The wrong metrics confuses may > confuse them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)