[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991374#comment-14991374 ] nijel commented on YARN-2934: - thanks [~Naganarasimha] for the patch Few minor comments/doubts 1. {code} FileStatus[] listStatus = fileSystem.listStatus(containerLogDir, new PathFilter() { @Override public boolean accept(Path path) { return FilenameUtils.wildcardMatch(path.getName(), errorFileNamePattern, IOCase.INSENSITIVE); } }); {code} What if this give multiple error files ? 2. {code} } catch (IOException e) { LOG.warn("Failed while trying to read container's error log", e); } {code} Can this be error log ? I think there should not be any exception in reading the file. If there is an error, better to log error log > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4331) Restarting NodeManager leaves orphaned containers
[ https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991729#comment-14991729 ] Jason Lowe commented on YARN-4331: -- SAMZA-750 is discussing RM restart, but this is NM restart. They are related but mostly independent features, and one can be enabled without the other. Check if yarn.nodemanager.recovery.enabled=true on that node. If you want to support rolling upgrades of the entire YARN cluster they both need to be enabled, but if you simply want to restart/upgrade a NodeManager independent of the ResourceManager then you can turn on nodemanager restart without resourcemanager restart. NodeManager restart should be mostly invisible to applications except for interruptions in the auxiliary services on that node (e.g.: shuffle handler). bq. if the application master (AM) is dead, shouldn't it be responsibility of the container to kill itself? That is completely application framework dependent and not the responsibility of YARN. A container is completely under the control of the application (i.e.: user code) and doesn't have to have any YARN code in it at all. Theoretically one could write an application entirely in C or Go or whatever that generates compatible protocol buffers and adheres to the YARN RPC protocol semantics. No YARN code would be running at all for that application or in any of its containers at that point. (I know of no such applications, but it is theoretically possible.) Also it is not a requirement that containers have an umbilical connection to the ApplicationMaster. That choice is up to the application, and some applications don't do this (like the distributed shell sample YARN application). MapReduce is an application framework that does have an umbilical connection, but if there's a bug in that app where tasks don't properly recognize the umbilical was severed then that's a bug in the app and not a bug in YARN. Once the nodemanager died on the node, YARN lost all ability to control containers on that node. If the container decides not to exit then that's an issue with the app more than an issue with YARN. There's not much YARN can do about it since YARN's actor on that node is no longer present. If NM restart is not enabled then the nodemanager should _not_ be killed with SIGKILL. Simply kill it with SIGTERM and the nodemanager should attempt to kill all containers before shutting down. Killing the NM with SIGKILL is normally only done when performing a work-preserving restart on the NM, and that requres that yarn.nodemanager.recovery.enabled=true on that node to function properly. > Restarting NodeManager leaves orphaned containers > - > > Key: YARN-4331 > URL: https://issues.apache.org/jira/browse/YARN-4331 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.7.1 >Reporter: Joseph >Priority: Critical > > We are seeing a lot of orphaned containers running in our production clusters. > I tried to simulate this locally on my machine and can replicate the issue by > killing nodemanager. > I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza > jobs. > Steps: > {quote}1. Deploy a job > 2. Issue a kill -9 signal to nodemanager > 3. We should see the AM and its container running without nodemanager > 4. AM should die but the container still keeps running > 5. Restarting nodemanager brings up new AM and container but leaves the > orphaned container running in the background > {quote} > This is effectively causing double processing of data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS
[ https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991492#comment-14991492 ] Steve Loughran commented on YARN-3946: -- I'd like to see this in application reports, so that client-side applications can display the details > Allow fetching exact reason as to why a submitted app is in ACCEPTED state in > CS > > > Key: YARN-3946 > URL: https://issues.apache.org/jira/browse/YARN-3946 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Sumit Nigam >Assignee: Naganarasimha G R > Attachments: YARN-3946.v1.001.patch, YARN3946_attemptDiagnistic > message.png > > > Currently there is no direct way to get the exact reason as to why a > submitted app is still in ACCEPTED state. It should be possible to know > through RM REST API as to what aspect is not being met - say, queue limits > being reached, or core/ memory requirement not being met, or AM limit being > reached, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS
[ https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991501#comment-14991501 ] Naganarasimha G R commented on YARN-3946: - Thanks [~steve_l], I am reworking on [~wangda] comments will consider this and upload a patch . > Allow fetching exact reason as to why a submitted app is in ACCEPTED state in > CS > > > Key: YARN-3946 > URL: https://issues.apache.org/jira/browse/YARN-3946 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Sumit Nigam >Assignee: Naganarasimha G R > Attachments: YARN-3946.v1.001.patch, YARN3946_attemptDiagnistic > message.png > > > Currently there is no direct way to get the exact reason as to why a > submitted app is still in ACCEPTED state. It should be possible to know > through RM REST API as to what aspect is not being met - say, queue limits > being reached, or core/ memory requirement not being met, or AM limit being > reached, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4331) Restarting NodeManager leaves orphaned containers
[ https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991575#comment-14991575 ] Joseph commented on YARN-4331: -- [~jlowe] Thanks for your comments, very helpful. yarn.resourcemanager.work-preserving-recovery.enabled is indeed set to false. The reason we have set it to false is because we run samza jobs on the yarn cluster and they don't work well with this feature turned on (https://issues.apache.org/jira/browse/SAMZA-750). Apologies for my ignorance in this area, but if the application master (AM) is dead, shouldn't it be responsibility of the container to kill itself? I'd imagine every container should be required to heartbeat to its application master and killing itself if it misses a few? > Restarting NodeManager leaves orphaned containers > - > > Key: YARN-4331 > URL: https://issues.apache.org/jira/browse/YARN-4331 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.7.1 >Reporter: Joseph >Priority: Critical > > We are seeing a lot of orphaned containers running in our production clusters. > I tried to simulate this locally on my machine and can replicate the issue by > killing nodemanager. > I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza > jobs. > Steps: > {quote}1. Deploy a job > 2. Issue a kill -9 signal to nodemanager > 3. We should see the AM and its container running without nodemanager > 4. AM should die but the container still keeps running > 5. Restarting nodemanager brings up new AM and container but leaves the > orphaned container running in the background > {quote} > This is effectively causing double processing of data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4312) TestSubmitApplicationWithRMHA fails on branch-2.7 and branch-2.6 as some of the test cases time out
[ https://issues.apache.org/jira/browse/YARN-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-4312: -- Fix Version/s: 2.6.3 Cherry-picked the fix to branch-2.6. Thanks [~varun_saxena]! > TestSubmitApplicationWithRMHA fails on branch-2.7 and branch-2.6 as some of > the test cases time out > > > Key: YARN-4312 > URL: https://issues.apache.org/jira/browse/YARN-4312 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.1, 2.7.1 >Reporter: Varun Saxena >Assignee: Varun Saxena > Fix For: 2.7.2, 2.6.3 > > Attachments: YARN-4312-branch-2.6.01.patch, > YARN-4312-branch-2.7.01.patch > > > These timeouts happen because we do ZK sync operation on RM startup after > YARN-3798 which delays RM startup a bit making the timeouts of 5 s. too small > for a couple of tests in TestSubmitApplicationWithRMHA. > {noformat} > testHandleRMHADuringSubmitApplicationCallWithSavedApplicationState(org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA) > Time elapsed: 5.162 sec <<< ERROR! > java.lang.Exception: test timed out after 5000 milliseconds > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.syncInternal(ZKRMStateStore.java:944) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.startInternal(ZKRMStateStore.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStart(RMStateStore.java:562) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:559) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:303) > at > org.apache.hadoop.yarn.server.resourcemanager.RMHATestBase.startRMs(RMHATestBase.java:191) > at > org.apache.hadoop.yarn.server.resourcemanager.RMHATestBase.startRMs(RMHATestBase.java:111) > at > org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA.testHandleRMHADuringSubmitApplicationCallWithSavedApplicationState(TestSubmitApplicationWithRMHA.java:234) > > testHandleRMHADuringSubmitApplicationCallWithoutSavedApplicationState(org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA) > Time elapsed: 5.146 sec <<< ERROR! > java.lang.Exception: test timed out after 5000 milliseconds > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.syncInternal(ZKRMStateStore.java:944) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.startInternal(ZKRMStateStore.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStart(RMStateStore.java:562) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at >
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992234#comment-14992234 ] Brook Zhou commented on YARN-3223: -- Unit tests that failed were not affected by patch. May be related to [YARN-2634|https://issues.apache.org/jira/browse/YARN-2634] > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2882) Introducing container types
[ https://issues.apache.org/jira/browse/YARN-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992207#comment-14992207 ] Konstantinos Karanasos commented on YARN-2882: -- Thanks for the feedback, [~asuresh]. I will address point (2), and will fix the patch so that it applies to the yarn-2877 branch I created off apache trunk yesterday. Moreover, I will make sure we align with YARN-3116 that introduced container types for distinguishing the AM container. Regarding the builder pattern, do you think we should address that here or is it better to create a separate JIRA for having a more principled way to create new instances of resources? > Introducing container types > --- > > Key: YARN-2882 > URL: https://issues.apache.org/jira/browse/YARN-2882 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: yarn-2882.patch > > > This JIRA introduces the notion of container types. > We propose two initial types of containers: guaranteed-start and queueable > containers. > Guaranteed-start are the existing containers, which are allocated by the > central RM and are instantaneously started, once allocated. > Queueable is a new type of container, which allows containers to be queued in > the NM, thus their execution may be arbitrarily delayed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992803#comment-14992803 ] Hadoop QA commented on YARN-3840: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 31s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 29s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 14s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 18s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 13s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in trunk has 3 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 57s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 18s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 17s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 10s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 56s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 54s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 18s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 47s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 53s {color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s {color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 23s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 54s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 39s {color} | {color:green} hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 1s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 7s {color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.7.0_79. {color} | |
[jira] [Commented] (YARN-4219) New levelDB cache storage for timeline v1.5
[ https://issues.apache.org/jira/browse/YARN-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992751#comment-14992751 ] Hadoop QA commented on YARN-4219: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 16s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 42s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 16s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s {color} | {color:red} Patch generated 4 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice (total was 59, now 63). {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 20s {color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 43s {color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 17m 21s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-05 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12770890/YARN-4219-trunk.003.patch | | JIRA Issue | YARN-4219 | | Optional Tests | asflicense javac javadoc mvninstall unit xml compile findbugs checkstyle | | uname | Linux 2d0c4cb2ffed 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality |
[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4334: --- Attachment: YARN-4334.wip.patch upload a prototype patch, which does heartbeat to LeveldbRMStateStore and on RM recovery it checks whether statestore is expired > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > Attachments: YARN-4334.wip.patch > > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4219) New levelDB cache storage for timeline v1.5
[ https://issues.apache.org/jira/browse/YARN-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-4219: Attachment: YARN-4219-trunk.003.patch Attaching 003 patch to fix javac warnings and javadoc errors. > New levelDB cache storage for timeline v1.5 > --- > > Key: YARN-4219 > URL: https://issues.apache.org/jira/browse/YARN-4219 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-4219-trunk.001.patch, YARN-4219-trunk.002.patch, > YARN-4219-trunk.003.patch > > > We need to have an "offline" caching storage for timeline server v1.5 after > the changes in YARN-3942. The in memory timeline storage may run into OOM > issues when used as a cache storage for entity file timeline storage. We can > refactor the code and have a level db based caching storage for this use > case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
Jason Lowe created YARN-4334: Summary: Ability to avoid ResourceManager recovery if state store is "too old" Key: YARN-4334 URL: https://issues.apache.org/jira/browse/YARN-4334 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jason Lowe There are times when a ResourceManager has been down long enough that ApplicationMasters and potentially external client-side monitoring mechanisms have given up completely. If the ResourceManager starts back up and tries to recover we can get into situations where the RM launches new application attempts for the AMs that gave up, but then the client _also_ launches another instance of the app because it assumed everything was dead. It would be nice if the RM could be optionally configured to avoid trying to recover if the state store was "too old." The RM would come up without any applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992405#comment-14992405 ] Jason Lowe commented on YARN-4334: -- This would probably involve some sort of "heartbeat" to the state store to keep track of an approximate last uptime of the ResourceManager. We would not want to update the state store very often, probably only on the order of a minute or so. One key use-case for this is Oozie. Oozie launchers have a known problem where when they restart they will re-launch applications. If the launcher AM gives up and the sub-job's AM gives up, then when the RM recovers and re-launches AM attempts for both jobs the launcher will re-submit the job. Then there will be two instances of the sub-job running which is undesirable. I suspect there are other job-launches-job situations besides Oozie where this would also be problematic. > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"
[ https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li reassigned YARN-4334: -- Assignee: Chang Li > Ability to avoid ResourceManager recovery if state store is "too old" > - > > Key: YARN-4334 > URL: https://issues.apache.org/jira/browse/YARN-4334 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Chang Li > > There are times when a ResourceManager has been down long enough that > ApplicationMasters and potentially external client-side monitoring mechanisms > have given up completely. If the ResourceManager starts back up and tries to > recover we can get into situations where the RM launches new application > attempts for the AMs that gave up, but then the client _also_ launches > another instance of the app because it assumed everything was dead. > It would be nice if the RM could be optionally configured to avoid trying to > recover if the state store was "too old." The RM would come up without any > applications recovered, but we would avoid a double-submission situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
[ https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992473#comment-14992473 ] Varun Saxena commented on YARN-4330: Its not retrying per say. Its just that we are trying to get memory and CPU info at multiple places. And at some places(for monitoring) its either trying to read the calculator plugin class from config and at some just directly the default one(while trying to detect NM's CPU/memory capability). As for default resource calculator plugin, Mac is not supported, hence the UnsupportedOperationException. While monitoring, if resource calculator plugin class is not configured, trying to load default calculator plugin(and hence this code path) is the default behavior. We cant really switch it off. But we need not print the whole stack trace for UnsupportedOperationException. For MiniYARNCluster, we can do a few more things. When we try and load default resource calculator plugin via NodeManagerHardwareUtils class(to detect NMs' CPU/memory), we can switch off this behavior via a config. Code can be rearranged so that this error doesnt show up and we check the config first. And this config can be set to false in MiniYarnCluster Also, a dummy plugin implementation can also be included in MiniYarnCluster and set in config so that it does not try to load default resource calculator > MiniYARNCluster prints multiple Failed to instantiate default resource > calculator warning messages > --- > > Key: YARN-4330 > URL: https://issues.apache.org/jira/browse/YARN-4330 > Project: Hadoop YARN > Issue Type: Bug > Components: test, yarn >Affects Versions: 2.8.0 > Environment: OSX, JUnit >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Blocker > > Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I > see multiple stack traces warning me that a resource calculator plugin could > not be created > {code} > (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - > java.lang.UnsupportedOperationException: Could not determine OS: Failed to > instantiate default resource calculator. > java.lang.UnsupportedOperationException: Could not determine OS > {code} > This is a minicluster. It doesn't need resource calculation. It certainly > doesn't need test logs being cluttered with even more stack traces which will > only generate false alarms about tests failing. > There needs to be a way to turn this off, and the minicluster should have it > that way by default. > Being ruthless and marking as a blocker, because its a fairly major > regression for anyone testing with the minicluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2859) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster
[ https://issues.apache.org/jira/browse/YARN-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-2859: -- Fix Version/s: 2.6.3 > ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster > -- > > Key: YARN-2859 > URL: https://issues.apache.org/jira/browse/YARN-2859 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Hitesh Shah >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0, 2.7.2, 2.6.3 > > Attachments: YARN-2859.txt > > > In mini cluster, a random port should be used. > Also, the config is not updated to the host that the process got bound to. > {code} > 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster > (MiniYARNCluster.java:serviceStart(722)) - MiniYARN ApplicationHistoryServer > address: localhost:10200 > 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster > (MiniYARNCluster.java:serviceStart(724)) - MiniYARN ApplicationHistoryServer > web address: 0.0.0.0:8188 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4219) New levelDB cache storage for timeline v1.5
[ https://issues.apache.org/jira/browse/YARN-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992814#comment-14992814 ] Xuan Gong commented on YARN-4219: - +1. The last patch looks good to me. Let us wait for several days. If there are no other comments, I will commit this later. [~jlowe] and [~jeagles] Do we have any comments for this ? > New levelDB cache storage for timeline v1.5 > --- > > Key: YARN-4219 > URL: https://issues.apache.org/jira/browse/YARN-4219 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-4219-trunk.001.patch, YARN-4219-trunk.002.patch, > YARN-4219-trunk.003.patch > > > We need to have an "offline" caching storage for timeline server v1.5 after > the changes in YARN-3942. The in memory timeline storage may run into OOM > issues when used as a cache storage for entity file timeline storage. We can > refactor the code and have a level db based caching storage for this use > case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4219) New levelDB cache storage for timeline v1.5
[ https://issues.apache.org/jira/browse/YARN-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-4219: Affects Version/s: 2.8.0 > New levelDB cache storage for timeline v1.5 > --- > > Key: YARN-4219 > URL: https://issues.apache.org/jira/browse/YARN-4219 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-4219-trunk.001.patch, YARN-4219-trunk.002.patch, > YARN-4219-trunk.003.patch > > > We need to have an "offline" caching storage for timeline server v1.5 after > the changes in YARN-3942. The in memory timeline storage may run into OOM > issues when used as a cache storage for entity file timeline storage. We can > refactor the code and have a level db based caching storage for this use > case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4335) Introducing resource request types
Konstantinos Karanasos created YARN-4335: Summary: Introducing resource request types Key: YARN-4335 URL: https://issues.apache.org/jira/browse/YARN-4335 Project: Hadoop YARN Issue Type: Sub-task Reporter: Konstantinos Karanasos Assignee: Konstantinos Karanasos YARN-2882 introduced container types that are internal (not user-facing) and are used by the ContainerManager during execution at the NM. With this JIRA we are introducing (user-facing) resource request types that are used by the AM to specify the type of the ResourceRequest. We will initially support two resource request types: CONSERVATIVE and OPTIMISTIC. CONSERVATIVE resource requests will be handed internally to containers of GUARANTEED type, whereas OPTIMISTIC resource requests will be handed to QUEUEABLE containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993149#comment-14993149 ] Naganarasimha G R commented on YARN-2556: - Hi [~sjlee0] & [~lichangleo] In Some Forum User Query came across an issue where each Insert was taking more time when the existing Level DB data size was more(about 3GB disk size), continuously at the rate of 10 inserts per sec for span of 15 mins . During each Insertion, query is happening to pick the Date for identifying {{CreationTime}} which i presume to be the reason for ATS inserts being slow. May be we can optimize the test to have some initial Level DB Data and then test the performance. This would also be useful to evaluate the performance of ATSV1.5. Thoughts ? > Tool to measure the performance of the timeline server > -- > > Key: YARN-2556 > URL: https://issues.apache.org/jira/browse/YARN-2556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Chang Li > Labels: BB2015-05-TBR > Fix For: 2.8.0 > > Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, > YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, > YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, > YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, > YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, > YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, > YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch > > > We need to be able to understand the capacity model for the timeline server > to give users the tools they need to deploy a timeline server with the > correct capacity. > I propose we create a mapreduce job that can measure timeline server write > and read performance. Transactions per second, I/O for both read and write > would be a good start. > This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2882) Introducing container types
[ https://issues.apache.org/jira/browse/YARN-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992845#comment-14992845 ] Konstantinos Karanasos commented on YARN-2882: -- Following our offline discussion with [~asuresh], we are going to integrate this JIRA with YARN-3116. In particular, we will extend the ContainerType by substituting the TASK field with the following two: GUARANTEED and QUEUEABLE. > Introducing container types > --- > > Key: YARN-2882 > URL: https://issues.apache.org/jira/browse/YARN-2882 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: yarn-2882.patch > > > This JIRA introduces the notion of container types. > We propose two initial types of containers: guaranteed-start and queueable > containers. > Guaranteed-start are the existing containers, which are allocated by the > central RM and are instantaneously started, once allocated. > Queueable is a new type of container, which allows containers to be queued in > the NM, thus their execution may be arbitrarily delayed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)