[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600817#comment-14600817 ] Hadoop QA commented on YARN-2871: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 51s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 31s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 46s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 49s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 30s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741794/YARN-2871.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a815cc1 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8341/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8341/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8341/console | This message was automatically generated. TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Attachments: YARN-2871.000.patch, YARN-2871.001.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601848#comment-14601848 ] Wangda Tan commented on YARN-3849: -- [~sunilg], Trying to understand this issue, when the toObtainResource becomes 10,0, and assume container size are c1=2,1, c2=5,3, c3=4,2, c4=2,1. Preemption policy will kill c1..c3, my understanding of this problem is preemption policy can preempt one of the resource type (CPU/Memory) more than needed, but I'm not sure why it preempts all containers except AM. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3793: --- Attachment: YARN-3793.01.patch Several NPEs when deleting local files on NM recovery - Key: YARN-3793 URL: https://issues.apache.org/jira/browse/YARN-3793 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-3793.01.patch When NM work-preserving restart is enabled, we see several NPEs on recovery. These seem to correspond to sub-directories that need to be deleted. I wonder if null pointers here mean incorrect tracking of these resources and a potential leak. This JIRA is to investigate and fix anything required. Logs show: {noformat} 2015-05-18 07:06:10,225 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : null 2015-05-18 07:06:10,224 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during execution of task in DeletionService java.lang.NullPointerException at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3793: --- Priority: Critical (was: Major) Several NPEs when deleting local files on NM recovery - Key: YARN-3793 URL: https://issues.apache.org/jira/browse/YARN-3793 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-3793.01.patch When NM work-preserving restart is enabled, we see several NPEs on recovery. These seem to correspond to sub-directories that need to be deleted. I wonder if null pointers here mean incorrect tracking of these resources and a potential leak. This JIRA is to investigate and fix anything required. Logs show: {noformat} 2015-05-18 07:06:10,225 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : null 2015-05-18 07:06:10,224 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during execution of task in DeletionService java.lang.NullPointerException at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup
[ https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602341#comment-14602341 ] Jian He edited comment on YARN-3855 at 6/26/15 4:35 AM: I believe what you suggested is a general good practice to setup secure cluster. Btw, the patch did not enable/enforce any of this. People can config whatever they want for the http authentication regardless how the rest components are setup before this jira. The point of this jira is to prevent this scenario that user cannot view any application (even for its own application) in whatever way unless the daemon is restarted. was (Author: jianhe): I believe what you suggested is a general good practice to setup secure cluster. Btw, the patch did not enable/enforce any of this. People can config whatever they want for the http authentication regardless how the rest components are setup before this jira. The point of this jira is to prevent this scenario that user cannot view the applications in whatever way unless the daemon is restarted. If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup Key: YARN-3855 URL: https://issues.apache.org/jira/browse/YARN-3855 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3855.1.patch, YARN-3855.2.patch If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who . User also cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default. This is inconvenient from user's perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3826: The timed out test is not related to the patch. +1, will commit it shortly. Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor
[ https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3745: Hadoop Flags: Reviewed +1, latest patch looks good to me, will commit it shortly. SerializedException should also try to instantiate internal exception with the default constructor -- Key: YARN-3745 URL: https://issues.apache.org/jira/browse/YARN-3745 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, YARN-3745.patch While deserialising a SerializedException it tries to create internal exception in instantiateException() with cn = cls.getConstructor(String.class). if cls does not has a constructor with String parameter it throws Nosuchmethodexception for example ClosedChannelException class. We should also try to instantiate exception with default constructor so that inner exception can to propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService: potential wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600981#comment-14600981 ] Hadoop QA commented on YARN-3826: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 16m 46s | Pre-patch trunk has 3 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 14s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-server-common. | | {color:red}-1{color} | yarn tests | 61m 0s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 101m 33s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741817/YARN-3826.03.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a815cc1 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8342/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8342/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8342/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8342/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8342/console | This message was automatically generated. Race condition in ResourceTrackerService: potential wrong diagnostics messages -- Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor
[ https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601145#comment-14601145 ] Hudson commented on YARN-3745: -- FAILURE: Integrated in Hadoop-trunk-Commit #8066 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8066/]) YARN-3745. SerializedException should also try to instantiate internal (devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * hadoop-yarn-project/CHANGES.txt SerializedException should also try to instantiate internal exception with the default constructor -- Key: YARN-3745 URL: https://issues.apache.org/jira/browse/YARN-3745 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Fix For: 2.8.0 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, YARN-3745.patch While deserialising a SerializedException it tries to create internal exception in instantiateException() with cn = cls.getConstructor(String.class). if cls does not has a constructor with String parameter it throws Nosuchmethodexception for example ClosedChannelException class. We should also try to instantiate exception with default constructor so that inner exception can to propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600995#comment-14600995 ] Hudson commented on YARN-3826: -- FAILURE: Integrated in Hadoop-trunk-Commit #8065 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8065/]) YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3830) AbstractYarnScheduler.createReleaseCache may try to clean a null attempt
[ https://issues.apache.org/jira/browse/YARN-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601042#comment-14601042 ] Hadoop QA commented on YARN-3830: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 9s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 39s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 53s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 49s | The applied patch generated 1 new checkstyle issues (total was 37, now 31). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 60m 43s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 99m 4s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741828/YARN-3830_2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a815cc1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8344/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8344/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8344/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8344/console | This message was automatically generated. AbstractYarnScheduler.createReleaseCache may try to clean a null attempt Key: YARN-3830 URL: https://issues.apache.org/jira/browse/YARN-3830 Project: Hadoop YARN Issue Type: Bug Reporter: nijel Assignee: nijel Attachments: YARN-3830_1.patch, YARN-3830_2.patch org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.createReleaseCache() {code} protected void createReleaseCache() { // Cleanup the cache after nm expire interval. new Timer().schedule(new TimerTask() { @Override public void run() { for (SchedulerApplicationT app : applications.values()) { T attempt = app.getCurrentAppAttempt(); synchronized (attempt) { for (ContainerId containerId : attempt.getPendingRelease()) { RMAuditLogger.logFailure( {code} Here the attempt can be null since the attempt is created later. So null pointer exception will come {code} 2015-06-19 09:29:16,195 | ERROR | Timer-3 | Thread Thread[Timer-3,5,main] threw an Exception. | YarnUncaughtExceptionHandler.java:68 java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler$1.run(AbstractYarnScheduler.java:457) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) {code} This will skip the other applications in this run. Can add a null check and continue with other applications -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3848) TestNodeLabelContainerAllocation is timing out
Jason Lowe created YARN-3848: Summary: TestNodeLabelContainerAllocation is timing out Key: YARN-3848 URL: https://issues.apache.org/jira/browse/YARN-3848 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Jason Lowe A number of builds, pre-commit and otherwise, have been failing recently because TestNodeLabelContainerAllocation has timed out. See https://builds.apache.org/job/Hadoop-Yarn-trunk/969/, YARN-3830, YARN-3802, or YARN-3826 for examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601070#comment-14601070 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/969/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601072#comment-14601072 ] Hudson commented on YARN-3790: -- FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/969/]) YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) * hadoop-yarn-project/CHANGES.txt usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container -- Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Sharma K S Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager
[ https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601074#comment-14601074 ] Hudson commented on YARN-3360: -- FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/969/]) YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) (jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java Add JMX metrics to TimelineDataManager -- Key: YARN-3360 URL: https://issues.apache.org/jira/browse/YARN-3360 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Labels: BB2015-05-TBR Fix For: 3.0.0, 2.8.0 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, YARN-3360.003.patch The TimelineDataManager currently has no metrics, outside of the standard JVM metrics. It would be very useful to at least log basic counts of method calls, time spent in those calls, and number of entities/events involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601073#comment-14601073 ] Hudson commented on YARN-3832: -- FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/969/]) YARN-3832. Resource Localization fails on a cluster due to existing cache directories. Contributed by Brahma Reddy Battula (jlowe: rev 8d58512d6e6d9fe93784a9de2af0056bcc316d96) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-3832 URL: https://issues.apache.org/jira/browse/YARN-3832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Ranga Swamy Assignee: Brahma Reddy Battula Priority: Critical Fix For: 2.7.1 Attachments: YARN-3832.patch *We have found resource localization fails on a cluster with following error.* Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624) {noformat} Application application_1434703279149_0057 failed 2 times due to AM Container for appattempt_1434703279149_0057_02 exited with exitCode: -1000 For more detailed output, check application tracking page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then, click on links to logs of each attempt. Diagnostics: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 java.io.IOException: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing the application. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3826: Hadoop Flags: Reviewed Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601051#comment-14601051 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601053#comment-14601053 ] Hudson commented on YARN-3790: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/]) YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) * hadoop-yarn-project/CHANGES.txt usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container -- Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Sharma K S Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager
[ https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601055#comment-14601055 ] Hudson commented on YARN-3360: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/]) YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) (jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java Add JMX metrics to TimelineDataManager -- Key: YARN-3360 URL: https://issues.apache.org/jira/browse/YARN-3360 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Labels: BB2015-05-TBR Fix For: 3.0.0, 2.8.0 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, YARN-3360.003.patch The TimelineDataManager currently has no metrics, outside of the standard JVM metrics. It would be very useful to at least log basic counts of method calls, time spent in those calls, and number of entities/events involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601054#comment-14601054 ] Hudson commented on YARN-3832: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/]) YARN-3832. Resource Localization fails on a cluster due to existing cache directories. Contributed by Brahma Reddy Battula (jlowe: rev 8d58512d6e6d9fe93784a9de2af0056bcc316d96) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-3832 URL: https://issues.apache.org/jira/browse/YARN-3832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Ranga Swamy Assignee: Brahma Reddy Battula Priority: Critical Fix For: 2.7.1 Attachments: YARN-3832.patch *We have found resource localization fails on a cluster with following error.* Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624) {noformat} Application application_1434703279149_0057 failed 2 times due to AM Container for appattempt_1434703279149_0057_02 exited with exitCode: -1000 For more detailed output, check application tracking page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then, click on links to logs of each attempt. Diagnostics: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 java.io.IOException: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing the application. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3826: Summary: Race condition in ResourceTrackerService leads to wrong diagnostics messages (was: Race condition in ResourceTrackerService: potential wrong diagnostics messages) Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601032#comment-14601032 ] Chengbing Liu commented on YARN-3826: - Thanks [~devaraj.k] for review and committing! Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601865#comment-14601865 ] Varun Saxena commented on YARN-3793: While NPEs' are a problem, on close look at the code shows that there is a bigger problem here and that is *container logs can be lost* if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. This sounds like a critical if not a blocker. [~kasha], [~jlowe], can you have a look ? I will upload a patch shortly. Several NPEs when deleting local files on NM recovery - Key: YARN-3793 URL: https://issues.apache.org/jira/browse/YARN-3793 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena When NM work-preserving restart is enabled, we see several NPEs on recovery. These seem to correspond to sub-directories that need to be deleted. I wonder if null pointers here mean incorrect tracking of these resources and a potential leak. This JIRA is to investigate and fix anything required. Logs show: {noformat} 2015-05-18 07:06:10,225 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : null 2015-05-18 07:06:10,224 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during execution of task in DeletionService java.lang.NullPointerException at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3830) AbstractYarnScheduler.createReleaseCache may try to clean a null attempt
[ https://issues.apache.org/jira/browse/YARN-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nijel updated YARN-3830: Attachment: YARN-3830_2.patch Thanks [~xgong] for the comment. Updated the patch Please review AbstractYarnScheduler.createReleaseCache may try to clean a null attempt Key: YARN-3830 URL: https://issues.apache.org/jira/browse/YARN-3830 Project: Hadoop YARN Issue Type: Bug Reporter: nijel Assignee: nijel Attachments: YARN-3830_1.patch, YARN-3830_2.patch org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.createReleaseCache() {code} protected void createReleaseCache() { // Cleanup the cache after nm expire interval. new Timer().schedule(new TimerTask() { @Override public void run() { for (SchedulerApplicationT app : applications.values()) { T attempt = app.getCurrentAppAttempt(); synchronized (attempt) { for (ContainerId containerId : attempt.getPendingRelease()) { RMAuditLogger.logFailure( {code} Here the attempt can be null since the attempt is created later. So null pointer exception will come {code} 2015-06-19 09:29:16,195 | ERROR | Timer-3 | Thread Thread[Timer-3,5,main] threw an Exception. | YarnUncaughtExceptionHandler.java:68 java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler$1.run(AbstractYarnScheduler.java:457) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) {code} This will skip the other applications in this run. Can add a null check and continue with other applications -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3846) RM Web UI queue fileter not working
[ https://issues.apache.org/jira/browse/YARN-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601859#comment-14601859 ] Wangda Tan commented on YARN-3846: -- [~mohdshahidkhan], Could you try https://issues.apache.org/jira/browse/YARN-2238 to see if this problem is already resolved? RM Web UI queue fileter not working --- Key: YARN-3846 URL: https://issues.apache.org/jira/browse/YARN-3846 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Mohammad Shahid Khan Assignee: Mohammad Shahid Khan Click on root queue will show the complete applications But click on the leaf queue is not filtering the application related to the the clicked queue. The regular expression seems to be wrong {code} q = '^' + q.substr(q.lastIndexOf(':') + 2) + '$';, {code} For example 1. Suppose queue name is b them the above expression will try to substr at index 1 q.lastIndexOf(':') = -1 -1+2= 1 which is wrong. its should look at the 0 index. 2. if queue name is ab.x then it will parse it to .x but it should be x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher
[ https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601884#comment-14601884 ] Varun Saxena commented on YARN-3508: [~jlowe]/[~jianhe]/[~leftnoteasy]/[~rohithsharma], can one of the commiters have a look at this ? :) Preemption processing occuring on the main RM dispatcher Key: YARN-3508 URL: https://issues.apache.org/jira/browse/YARN-3508 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-3508.002.patch, YARN-3508.01.patch We recently saw the RM for a large cluster lag far behind on the AsyncDispacher event queue. The AsyncDispatcher thread was consistently blocked on the highly-contended CapacityScheduler lock trying to dispatch preemption-related events for RMContainerPreemptEventDispatcher. Preemption processing should occur on the scheduler event dispatcher thread or a separate thread to avoid delaying the processing of other events in the primary dispatcher queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher
[ https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601947#comment-14601947 ] Wangda Tan commented on YARN-3508: -- I tent to support [~jlowe] and [~jianhe]'s suggestion, make preemption events directly goes to scheduler event queue. I think we cannot assume preemption events have higher priority than other events, in most cases, preemption events are just notify AM about something will happen. And manage two queues for scheduler can be complex, how to balance them, etc. To reduce complex, I suggest to only maintain one queue for scheduler until we have to. Preemption processing occuring on the main RM dispatcher Key: YARN-3508 URL: https://issues.apache.org/jira/browse/YARN-3508 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-3508.002.patch, YARN-3508.01.patch We recently saw the RM for a large cluster lag far behind on the AsyncDispacher event queue. The AsyncDispatcher thread was consistently blocked on the highly-contended CapacityScheduler lock trying to dispatch preemption-related events for RMContainerPreemptEventDispatcher. Preemption processing should occur on the scheduler event dispatcher thread or a separate thread to avoid delaying the processing of other events in the primary dispatcher queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601956#comment-14601956 ] Jason Lowe commented on YARN-3793: -- It sounds like the NPEs are scary in the logs but benign in practice, since they occur in situations where we don't actually want to delete anything anyway. Regarding loss of logs, I agree with your analysis. Makes me think there should be a getLogDirsForRead that can be used for places to search for files that are already there. The NPE and the log loss are unrelated, so arguably the blocker of log loss should be tracked in a separate JIRA. Several NPEs when deleting local files on NM recovery - Key: YARN-3793 URL: https://issues.apache.org/jira/browse/YARN-3793 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-3793.01.patch When NM work-preserving restart is enabled, we see several NPEs on recovery. These seem to correspond to sub-directories that need to be deleted. I wonder if null pointers here mean incorrect tracking of these resources and a potential leak. This JIRA is to investigate and fix anything required. Logs show: {noformat} 2015-05-18 07:06:10,225 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : null 2015-05-18 07:06:10,224 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during execution of task in DeletionService java.lang.NullPointerException at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1965) Interrupted exception when closing YarnClient
[ https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601965#comment-14601965 ] zhihai xu commented on YARN-1965: - Should this be a hadoop common issue? Looks like all the changes are in hadoop common project. Interrupted exception when closing YarnClient - Key: YARN-1965 URL: https://issues.apache.org/jira/browse/YARN-1965 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky Assignee: Kuhu Shukla Priority: Minor Labels: newbie Attachments: YARN-1965-v2.patch, YARN-1965.patch Its more of a nuisance then a bug, but nevertheless {code} 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting for clientExecutorto stop java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468) at org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191) at org.apache.hadoop.ipc.Client.stop(Client.java:1235) at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621) at org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57) at org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) . . . {code} It happens sporadically when stopping YarnClient. Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious why and who throws the interrupt but in any event it should not be logged as ERROR. Probably a WARN with no stack trace. Also, for consistency and correctness you may want to Interrupt current thread as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601974#comment-14601974 ] Varun Saxena commented on YARN-3793: Thanks for looking at this [~jlowe]. I will raise a separate JIRA for this. getLogDirsForRead will be same as getLogDirsForCleanup but I guess would be semantically correct to use it. Several NPEs when deleting local files on NM recovery - Key: YARN-3793 URL: https://issues.apache.org/jira/browse/YARN-3793 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-3793.01.patch When NM work-preserving restart is enabled, we see several NPEs on recovery. These seem to correspond to sub-directories that need to be deleted. I wonder if null pointers here mean incorrect tracking of these resources and a potential leak. This JIRA is to investigate and fix anything required. Logs show: {noformat} 2015-05-18 07:06:10,225 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : null 2015-05-18 07:06:10,224 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during execution of task in DeletionService java.lang.NullPointerException at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3850) Container logs can be lost if disk is full
Varun Saxena created YARN-3850: -- Summary: Container logs can be lost if disk is full Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3611) Support Docker Containers In LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601981#comment-14601981 ] Sidharta Seethana commented on YARN-3611: - [~ashahab] and I have been working together on this for the past few weeks. (We demoed this recently as well). I am going to file sub tasks so that we can make progress. thanks, -Sidharta Support Docker Containers In LinuxContainerExecutor --- Key: YARN-3611 URL: https://issues.apache.org/jira/browse/YARN-3611 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sidharta Seethana Assignee: Sidharta Seethana Support Docker Containers In LinuxContainerExecutor LinuxContainerExecutor provides useful functionality today with respect to localization, cgroups based resource management and isolation for CPU, network, disk etc. as well as security with a well-defined mechanism to execute privileged operations using the container-executor utility. Bringing docker support to LinuxContainerExecutor lets us use all of this functionality when running docker containers under YARN, while not requiring users and admins to configure and use a different ContainerExecutor. There are several aspects here that need to be worked through : * Mechanism(s) to let clients request docker-specific functionality - we could initially implement this via environment variables without impacting the client API. * Security - both docker daemon as well as application * Docker image localization * Running a docker container via container-executor as a specified user * “Isolate” the docker container in terms of CPU/network/disk/etc * Communicating with and/or signaling the running container (ensure correct pid handling) * Figure out workarounds for certain performance-sensitive scenarios like HDFS short-circuit reads * All of these need to be achieved without changing the current behavior of LinuxContainerExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1965) Interrupted exception when closing YarnClient
[ https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-1965: -- Attachment: YARN-1965-v2.patch Patch with correction for whitespace. Fix to log Interrupted Exception in IPC Client as a warning. The current thread is interrupted once the Exception is caught. Also, some cleanup code in TestIPC is added so that the client executor count is decremented after each test. Interrupted exception when closing YarnClient - Key: YARN-1965 URL: https://issues.apache.org/jira/browse/YARN-1965 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky Assignee: Kuhu Shukla Priority: Minor Labels: newbie Attachments: YARN-1965-v2.patch, YARN-1965.patch Its more of a nuisance then a bug, but nevertheless {code} 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting for clientExecutorto stop java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468) at org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191) at org.apache.hadoop.ipc.Client.stop(Client.java:1235) at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621) at org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57) at org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) . . . {code} It happens sporadically when stopping YarnClient. Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious why and who throws the interrupt but in any event it should not be logged as ERROR. Probably a WARN with no stack trace. Also, for consistency and correctness you may want to Interrupt current thread as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor
[ https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601413#comment-14601413 ] Hudson commented on YARN-3745: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/]) YARN-3745. SerializedException should also try to instantiate internal (devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java SerializedException should also try to instantiate internal exception with the default constructor -- Key: YARN-3745 URL: https://issues.apache.org/jira/browse/YARN-3745 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Fix For: 2.8.0 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, YARN-3745.patch While deserialising a SerializedException it tries to create internal exception in instantiateException() with cn = cls.getConstructor(String.class). if cls does not has a constructor with String parameter it throws Nosuchmethodexception for example ClosedChannelException class. We should also try to instantiate exception with default constructor so that inner exception can to propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3827) Migrate YARN native build to new CMake framework
[ https://issues.apache.org/jira/browse/YARN-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison updated YARN-3827: Attachment: YARN-3827.001.patch Migrate YARN native build to new CMake framework Key: YARN-3827 URL: https://issues.apache.org/jira/browse/YARN-3827 Project: Hadoop YARN Issue Type: Sub-task Components: build Affects Versions: 2.7.0 Reporter: Alan Burlison Assignee: Alan Burlison Attachments: YARN-3827.001.patch As per HADOOP-12036, the CMake infrastructure should be refactored and made common across all Hadoop components. This bug covers the migration of YARN to the new CMake infrastructure. This change will also add support for building YARN Native components on Solaris. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601369#comment-14601369 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3830) AbstractYarnScheduler.createReleaseCache may try to clean a null attempt
[ https://issues.apache.org/jira/browse/YARN-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nijel updated YARN-3830: Attachment: YARN-3830_3.patch Sorry for the small mistake Line limit is corrected Test fail is not related to this patch. Verified locally. It is passing AbstractYarnScheduler.createReleaseCache may try to clean a null attempt Key: YARN-3830 URL: https://issues.apache.org/jira/browse/YARN-3830 Project: Hadoop YARN Issue Type: Bug Reporter: nijel Assignee: nijel Attachments: YARN-3830_1.patch, YARN-3830_2.patch, YARN-3830_3.patch org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.createReleaseCache() {code} protected void createReleaseCache() { // Cleanup the cache after nm expire interval. new Timer().schedule(new TimerTask() { @Override public void run() { for (SchedulerApplicationT app : applications.values()) { T attempt = app.getCurrentAppAttempt(); synchronized (attempt) { for (ContainerId containerId : attempt.getPendingRelease()) { RMAuditLogger.logFailure( {code} Here the attempt can be null since the attempt is created later. So null pointer exception will come {code} 2015-06-19 09:29:16,195 | ERROR | Timer-3 | Thread Thread[Timer-3,5,main] threw an Exception. | YarnUncaughtExceptionHandler.java:68 java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler$1.run(AbstractYarnScheduler.java:457) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) {code} This will skip the other applications in this run. Can add a null check and continue with other applications -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601412#comment-14601412 ] Hudson commented on YARN-3832: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/]) YARN-3832. Resource Localization fails on a cluster due to existing cache directories. Contributed by Brahma Reddy Battula (jlowe: rev 8d58512d6e6d9fe93784a9de2af0056bcc316d96) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-3832 URL: https://issues.apache.org/jira/browse/YARN-3832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Ranga Swamy Assignee: Brahma Reddy Battula Priority: Critical Fix For: 2.7.1 Attachments: YARN-3832.patch *We have found resource localization fails on a cluster with following error.* Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624) {noformat} Application application_1434703279149_0057 failed 2 times due to AM Container for appattempt_1434703279149_0057_02 exited with exitCode: -1000 For more detailed output, check application tracking page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then, click on links to logs of each attempt. Diagnostics: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 java.io.IOException: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing the application. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601407#comment-14601407 ] Hudson commented on YARN-3826: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/]) YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601411#comment-14601411 ] Hudson commented on YARN-3790: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/]) YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) * hadoop-yarn-project/CHANGES.txt usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container -- Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Sharma K S Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager
[ https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601414#comment-14601414 ] Hudson commented on YARN-3360: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/]) YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) (jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java Add JMX metrics to TimelineDataManager -- Key: YARN-3360 URL: https://issues.apache.org/jira/browse/YARN-3360 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Labels: BB2015-05-TBR Fix For: 3.0.0, 2.8.0 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, YARN-3360.003.patch The TimelineDataManager currently has no metrics, outside of the standard JVM metrics. It would be very useful to at least log basic counts of method calls, time spent in those calls, and number of entities/events involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601409#comment-14601409 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3827) Migrate YARN native build to new CMake framework
[ https://issues.apache.org/jira/browse/YARN-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison updated YARN-3827: Attachment: (was: YARN-3827.001.patch) Migrate YARN native build to new CMake framework Key: YARN-3827 URL: https://issues.apache.org/jira/browse/YARN-3827 Project: Hadoop YARN Issue Type: Sub-task Components: build Affects Versions: 2.7.0 Reporter: Alan Burlison Assignee: Alan Burlison As per HADOOP-12036, the CMake infrastructure should be refactored and made common across all Hadoop components. This bug covers the migration of YARN to the new CMake infrastructure. This change will also add support for building YARN Native components on Solaris. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor
[ https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601373#comment-14601373 ] Hudson commented on YARN-3745: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/]) YARN-3745. SerializedException should also try to instantiate internal (devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java SerializedException should also try to instantiate internal exception with the default constructor -- Key: YARN-3745 URL: https://issues.apache.org/jira/browse/YARN-3745 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Fix For: 2.8.0 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, YARN-3745.patch While deserialising a SerializedException it tries to create internal exception in instantiateException() with cn = cls.getConstructor(String.class). if cls does not has a constructor with String parameter it throws Nosuchmethodexception for example ClosedChannelException class. We should also try to instantiate exception with default constructor so that inner exception can to propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601367#comment-14601367 ] Hudson commented on YARN-3826: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/]) YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java * hadoop-yarn-project/CHANGES.txt Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager
[ https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601374#comment-14601374 ] Hudson commented on YARN-3360: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/]) YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) (jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/CHANGES.txt Add JMX metrics to TimelineDataManager -- Key: YARN-3360 URL: https://issues.apache.org/jira/browse/YARN-3360 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Labels: BB2015-05-TBR Fix For: 3.0.0, 2.8.0 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, YARN-3360.003.patch The TimelineDataManager currently has no metrics, outside of the standard JVM metrics. It would be very useful to at least log basic counts of method calls, time spent in those calls, and number of entities/events involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601372#comment-14601372 ] Hudson commented on YARN-3832: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/]) YARN-3832. Resource Localization fails on a cluster due to existing cache directories. Contributed by Brahma Reddy Battula (jlowe: rev 8d58512d6e6d9fe93784a9de2af0056bcc316d96) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-3832 URL: https://issues.apache.org/jira/browse/YARN-3832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Ranga Swamy Assignee: Brahma Reddy Battula Priority: Critical Fix For: 2.7.1 Attachments: YARN-3832.patch *We have found resource localization fails on a cluster with following error.* Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624) {noformat} Application application_1434703279149_0057 failed 2 times due to AM Container for appattempt_1434703279149_0057_02 exited with exitCode: -1000 For more detailed output, check application tracking page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then, click on links to logs of each attempt. Diagnostics: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 java.io.IOException: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing the application. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601371#comment-14601371 ] Hudson commented on YARN-3790: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/]) YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) * hadoop-yarn-project/CHANGES.txt usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container -- Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Sharma K S Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Yaniv updated YARN-3656: - Attachment: YARN-3656-v1.2.patch LowCost: A Cost-Based Placement Agent for YARN Reservations --- Key: YARN-3656 URL: https://issues.apache.org/jira/browse/YARN-3656 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Ishai Menache Assignee: Jonathan Yaniv Labels: capacity-scheduler, resourcemanager Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, YARN-3656-v1.2.patch, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf YARN-1051 enables SLA support by allowing users to reserve cluster capacity ahead of time. YARN-1710 introduced a greedy agent for placing user reservations. The greedy agent makes fast placement decisions but at the cost of ignoring the cluster committed resources, which might result in blocking the cluster resources for certain periods of time, and in turn rejecting some arriving jobs. We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1965) Interrupted exception when closing YarnClient
[ https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601394#comment-14601394 ] Hadoop QA commented on YARN-1965: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 4s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 50s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 0s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 8s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 51s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 22m 21s | Tests passed in hadoop-common. | | | | 62m 51s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741865/YARN-1965-v2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / b381f88 | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8345/artifact/patchprocess/testrun_hadoop-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8345/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8345/console | This message was automatically generated. Interrupted exception when closing YarnClient - Key: YARN-1965 URL: https://issues.apache.org/jira/browse/YARN-1965 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky Assignee: Kuhu Shukla Priority: Minor Labels: newbie Attachments: YARN-1965-v2.patch, YARN-1965.patch Its more of a nuisance then a bug, but nevertheless {code} 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting for clientExecutorto stop java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468) at org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191) at org.apache.hadoop.ipc.Client.stop(Client.java:1235) at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621) at org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57) at org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) . . . {code} It happens sporadically when stopping YarnClient. Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious why and who throws the interrupt but in any event it should not be logged as ERROR. Probably a WARN with no stack trace. Also, for consistency and correctness you may want to Interrupt current thread as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2004: -- Attachment: 0007-YARN-2004.patch Rebasing patch against latest trunk. Also made changes as per OrderingPolicy in CS. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 0006-YARN-2004.patch, 0007-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3827) Migrate YARN native build to new CMake framework
[ https://issues.apache.org/jira/browse/YARN-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison updated YARN-3827: Attachment: YARN-3827.001.patch Migrate YARN native build to new CMake framework Key: YARN-3827 URL: https://issues.apache.org/jira/browse/YARN-3827 Project: Hadoop YARN Issue Type: Sub-task Components: build Affects Versions: 2.7.0 Reporter: Alan Burlison Assignee: Alan Burlison As per HADOOP-12036, the CMake infrastructure should be refactored and made common across all Hadoop components. This bug covers the migration of YARN to the new CMake infrastructure. This change will also add support for building YARN Native components on Solaris. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3827) Migrate YARN native build to new CMake framework
[ https://issues.apache.org/jira/browse/YARN-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison updated YARN-3827: Attachment: (was: YARN-3827.001.patch) Migrate YARN native build to new CMake framework Key: YARN-3827 URL: https://issues.apache.org/jira/browse/YARN-3827 Project: Hadoop YARN Issue Type: Sub-task Components: build Affects Versions: 2.7.0 Reporter: Alan Burlison Assignee: Alan Burlison As per HADOOP-12036, the CMake infrastructure should be refactored and made common across all Hadoop components. This bug covers the migration of YARN to the new CMake infrastructure. This change will also add support for building YARN Native components on Solaris. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601447#comment-14601447 ] Hudson commented on YARN-3790: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/]) YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) * hadoop-yarn-project/CHANGES.txt usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container -- Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Sharma K S Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601448#comment-14601448 ] Hudson commented on YARN-3832: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/]) YARN-3832. Resource Localization fails on a cluster due to existing cache directories. Contributed by Brahma Reddy Battula (jlowe: rev 8d58512d6e6d9fe93784a9de2af0056bcc316d96) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-3832 URL: https://issues.apache.org/jira/browse/YARN-3832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Ranga Swamy Assignee: Brahma Reddy Battula Priority: Critical Fix For: 2.7.1 Attachments: YARN-3832.patch *We have found resource localization fails on a cluster with following error.* Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624) {noformat} Application application_1434703279149_0057 failed 2 times due to AM Container for appattempt_1434703279149_0057_02 exited with exitCode: -1000 For more detailed output, check application tracking page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then, click on links to logs of each attempt. Diagnostics: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 java.io.IOException: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing the application. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601443#comment-14601443 ] Hudson commented on YARN-3826: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/]) YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/CHANGES.txt Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager
[ https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601450#comment-14601450 ] Hudson commented on YARN-3360: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/]) YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) (jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java Add JMX metrics to TimelineDataManager -- Key: YARN-3360 URL: https://issues.apache.org/jira/browse/YARN-3360 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Labels: BB2015-05-TBR Fix For: 3.0.0, 2.8.0 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, YARN-3360.003.patch The TimelineDataManager currently has no metrics, outside of the standard JVM metrics. It would be very useful to at least log basic counts of method calls, time spent in those calls, and number of entities/events involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations
[ https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601472#comment-14601472 ] Hadoop QA commented on YARN-3656: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 14m 56s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 32s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 32s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 24s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 50s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 87m 16s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741868/YARN-3656-v1.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / b381f88 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8346/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8346/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8346/console | This message was automatically generated. LowCost: A Cost-Based Placement Agent for YARN Reservations --- Key: YARN-3656 URL: https://issues.apache.org/jira/browse/YARN-3656 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Ishai Menache Assignee: Jonathan Yaniv Labels: capacity-scheduler, resourcemanager Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, YARN-3656-v1.2.patch, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf YARN-1051 enables SLA support by allowing users to reserve cluster capacity ahead of time. YARN-1710 introduced a greedy agent for placing user reservations. The greedy agent makes fast placement decisions but at the cost of ignoring the cluster committed resources, which might result in blocking the cluster resources for certain periods of time, and in turn rejecting some arriving jobs. We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” the demand of the job throughout the allowed time-window according to a global, load-based cost function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
[ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601479#comment-14601479 ] Ming Ma commented on YARN-221: -- Here is the scenario. a) no applications want to over the default. b) Administrators of the cluster want to make a cluster-side global change from sample rate of 20 percent to 50 percent. NM should provide a way for AM to tell it not to aggregate logs. Key: YARN-221 URL: https://issues.apache.org/jira/browse/YARN-221 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Reporter: Robert Joseph Evans Assignee: Ming Ma Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch The NodeManager should provide a way for an AM to tell it that either the logs should not be aggregated, that they should be aggregated with a high priority, or that they should be aggregated but with a lower priority. The AM should be able to do this in the ContainerLaunch context to provide a default value, but should also be able to update the value when the container is released. This would allow for the NM to not aggregate logs in some cases, and avoid connection to the NN at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
[ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601616#comment-14601616 ] Xuan Gong commented on YARN-221: bq. Here is the scenario. a) no applications want to over the default. b) Administrators of the cluster want to make a cluster-side global change from sample rate of 20 percent to 50 percent. OK. This makes sense. Thanks for explanation. NM should provide a way for AM to tell it not to aggregate logs. Key: YARN-221 URL: https://issues.apache.org/jira/browse/YARN-221 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Reporter: Robert Joseph Evans Assignee: Ming Ma Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch The NodeManager should provide a way for an AM to tell it that either the logs should not be aggregated, that they should be aggregated with a high priority, or that they should be aggregated but with a lower priority. The AM should be able to do this in the ContainerLaunch context to provide a default value, but should also be able to update the value when the container is released. This would allow for the NM to not aggregate logs in some cases, and avoid connection to the NN at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3724) Use POSIX nftw(3) instead of fts(3)
[ https://issues.apache.org/jira/browse/YARN-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601434#comment-14601434 ] Alan Burlison commented on YARN-3724: - See also the discussion in http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201506.mbox/%3C558BCA3A.1020602%40oracle.com%3E. The use of fts(3) should be replaced by nftw(3) Use POSIX nftw(3) instead of fts(3) --- Key: YARN-3724 URL: https://issues.apache.org/jira/browse/YARN-3724 Project: Hadoop YARN Issue Type: Sub-task Environment: Solaris 11.2 Reporter: Malcolm Kavalsky Assignee: Alan Burlison Original Estimate: 24h Remaining Estimate: 24h Compiling the Yarn Node Manager results in fts not found. On Solaris we have an alternative ftw with similar functionality. This is isolated to a single file container-executor.c Note that this will just fix the compilation error. A more serious issue is that Solaris does not support cgroups as Linux does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601475#comment-14601475 ] Sunil G commented on YARN-3849: --- Looping [~rohithsharma] and [~leftnoteasy] Since we use Dominant resource calculator, below piece of code in ProportionalPreemptionPolicy looks doubtful {code} // When we have no more resource need to obtain, remove from map. if (Resources.lessThanOrEqual(rc, clusterResource, toObtainByPartition, Resources.none())) { resourceToObtainByPartitions.remove(nodePartition); } {code} Assume toObtainByPartition is 12, 1 ()memory, core). After another round of preemption, this will become 10, 0. If the above check hits with this value, its supposed to return TRUE. But the method returns FALSE. Reason is that due to dominance, if any resource item is non-zero then that is returned as true. {code} // Just use 'dominant' resource return (dominant) ? Math.max( (float)resource.getMemory() / clusterResource.getMemory(), (float)resource.getVirtualCores() / clusterResource.getVirtualCores() ) : Math.min( (float)resource.getMemory() / clusterResource.getMemory(), (float)resource.getVirtualCores() / clusterResource.getVirtualCores() ); {code} If resource.getVirtualCores() is ZERO and resource.getMemory() is Non-Zero, then this check will return +ve. We feel that this has to be checked prior and if one item is ZERO, we have to say lhs is lesser to rhs. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3838) Rest API failing when ip configured in RM address in secure https mode
[ https://issues.apache.org/jira/browse/YARN-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3838: --- Attachment: 0002-YARN-3838.patch Updated patch since new util is not required to be added. Rest API failing when ip configured in RM address in secure https mode -- Key: YARN-3838 URL: https://issues.apache.org/jira/browse/YARN-3838 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-HADOOP-12096.patch, 0001-YARN-3810.patch, 0001-YARN-3838.patch, 0002-YARN-3810.patch, 0002-YARN-3838.patch Steps to reproduce === 1.Configure hadoop.http.authentication.kerberos.principal as below {code:xml} property namehadoop.http.authentication.kerberos.principal/name valueHTTP/_h...@hadoop.com/value /property {code} 2. In RM web address also configure IP 3. Startup RM Call Rest API for RM {{ curl -i -k --insecure --negotiate -u : https IP /ws/v1/cluster/info}} *Actual* Rest API failing {code} 2015-06-16 19:03:49,845 DEBUG org.apache.hadoop.security.authentication.server.AuthenticationFilter: Authentication exception: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails) org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails) at org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519) at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
Sunil G created YARN-3849: - Summary: Too much of preemption activity causing continuos killing of containers across queues Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601461#comment-14601461 ] Hudson commented on YARN-3826: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/]) YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java * hadoop-yarn-project/CHANGES.txt Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601463#comment-14601463 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor
[ https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601467#comment-14601467 ] Hudson commented on YARN-3745: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/]) YARN-3745. SerializedException should also try to instantiate internal (devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * hadoop-yarn-project/CHANGES.txt SerializedException should also try to instantiate internal exception with the default constructor -- Key: YARN-3745 URL: https://issues.apache.org/jira/browse/YARN-3745 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Fix For: 2.8.0 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, YARN-3745.patch While deserialising a SerializedException it tries to create internal exception in instantiateException() with cn = cls.getConstructor(String.class). if cls does not has a constructor with String parameter it throws Nosuchmethodexception for example ClosedChannelException class. We should also try to instantiate exception with default constructor so that inner exception can to propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601466#comment-14601466 ] Hudson commented on YARN-3832: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/]) YARN-3832. Resource Localization fails on a cluster due to existing cache directories. Contributed by Brahma Reddy Battula (jlowe: rev 8d58512d6e6d9fe93784a9de2af0056bcc316d96) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-3832 URL: https://issues.apache.org/jira/browse/YARN-3832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Ranga Swamy Assignee: Brahma Reddy Battula Priority: Critical Fix For: 2.7.1 Attachments: YARN-3832.patch *We have found resource localization fails on a cluster with following error.* Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624) {noformat} Application application_1434703279149_0057 failed 2 times due to AM Container for appattempt_1434703279149_0057_02 exited with exitCode: -1000 For more detailed output, check application tracking page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then, click on links to logs of each attempt. Diagnostics: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 java.io.IOException: Rename cannot overwrite non empty destination directory /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing the application. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601465#comment-14601465 ] Hudson commented on YARN-3790: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/]) YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) * hadoop-yarn-project/CHANGES.txt usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container -- Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Sharma K S Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager
[ https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601468#comment-14601468 ] Hudson commented on YARN-3360: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/]) YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) (jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java Add JMX metrics to TimelineDataManager -- Key: YARN-3360 URL: https://issues.apache.org/jira/browse/YARN-3360 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Labels: BB2015-05-TBR Fix For: 3.0.0, 2.8.0 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, YARN-3360.003.patch The TimelineDataManager currently has no metrics, outside of the standard JVM metrics. It would be very useful to at least log basic counts of method calls, time spent in those calls, and number of entities/events involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3724) Use POSIX nftw(3) instead of fts(3)
[ https://issues.apache.org/jira/browse/YARN-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison updated YARN-3724: Summary: Use POSIX nftw(3) instead of fts(3) (was: Native compilation on Solaris fails on Yarn due to use of FTS) Use POSIX nftw(3) instead of fts(3) --- Key: YARN-3724 URL: https://issues.apache.org/jira/browse/YARN-3724 Project: Hadoop YARN Issue Type: Sub-task Environment: Solaris 11.2 Reporter: Malcolm Kavalsky Assignee: Alan Burlison Original Estimate: 24h Remaining Estimate: 24h Compiling the Yarn Node Manager results in fts not found. On Solaris we have an alternative ftw with similar functionality. This is isolated to a single file container-executor.c Note that this will just fix the compilation error. A more serious issue is that Solaris does not support cgroups as Linux does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601445#comment-14601445 ] Hudson commented on YARN-3809: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/]) YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's threads all hang. Contributed by Jun Gong (jlowe: rev 2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Failed to launch new attempts because ApplicationMasterLauncher's threads all hang -- Key: YARN-3809 URL: https://issues.apache.org/jira/browse/YARN-3809 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.7.1 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor
[ https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601449#comment-14601449 ] Hudson commented on YARN-3745: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/]) YARN-3745. SerializedException should also try to instantiate internal (devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java SerializedException should also try to instantiate internal exception with the default constructor -- Key: YARN-3745 URL: https://issues.apache.org/jira/browse/YARN-3745 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Fix For: 2.8.0 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, YARN-3745.patch While deserialising a SerializedException it tries to create internal exception in instantiateException() with cn = cls.getConstructor(String.class). if cls does not has a constructor with String parameter it throws Nosuchmethodexception for example ClosedChannelException class. We should also try to instantiate exception with default constructor so that inner exception can to propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3409) Add constraint node labels
[ https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601495#comment-14601495 ] chong chen commented on YARN-3409: -- Any update on this? Add constraint node labels -- Key: YARN-3409 URL: https://issues.apache.org/jira/browse/YARN-3409 Project: Hadoop YARN Issue Type: Sub-task Components: api, capacityscheduler, client Reporter: Wangda Tan Assignee: Wangda Tan Specify only one label for each node (IAW, partition a cluster) is a way to determinate how resources of a special set of nodes could be shared by a group of entities (like teams, departments, etc.). Partitions of a cluster has following characteristics: - Cluster divided to several disjoint sub clusters. - ACL/priority can apply on partition (Only market team / marke team has priority to use the partition). - Percentage of capacities can apply on partition (Market team has 40% minimum capacity and Dev team has 60% of minimum capacity of the partition). Constraints are orthogonal to partition, they’re describing attributes of node’s hardware/software just for affinity. Some example of constraints: - glibc version - JDK version - Type of CPU (x86_64/i686) - Type of OS (windows, linux, etc.) With this, application can be able to ask for resource has (glibc.version = 2.20 JDK.version = 8u20 x86_64). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3838) Rest API failing when ip configured in RM address in secure https mode
[ https://issues.apache.org/jira/browse/YARN-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601565#comment-14601565 ] Hadoop QA commented on YARN-3838: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 7s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 54s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 34s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | | | 40m 19s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741889/0002-YARN-3838.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / bc43390 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8348/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8348/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8348/console | This message was automatically generated. Rest API failing when ip configured in RM address in secure https mode -- Key: YARN-3838 URL: https://issues.apache.org/jira/browse/YARN-3838 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-HADOOP-12096.patch, 0001-YARN-3810.patch, 0001-YARN-3838.patch, 0002-YARN-3810.patch, 0002-YARN-3838.patch Steps to reproduce === 1.Configure hadoop.http.authentication.kerberos.principal as below {code:xml} property namehadoop.http.authentication.kerberos.principal/name valueHTTP/_h...@hadoop.com/value /property {code} 2. In RM web address also configure IP 3. Startup RM Call Rest API for RM {{ curl -i -k --insecure --negotiate -u : https IP /ws/v1/cluster/info}} *Actual* Rest API failing {code} 2015-06-16 19:03:49,845 DEBUG org.apache.hadoop.security.authentication.server.AuthenticationFilter: Authentication exception: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails) org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails) at org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519) at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3830) AbstractYarnScheduler.createReleaseCache may try to clean a null attempt
[ https://issues.apache.org/jira/browse/YARN-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601538#comment-14601538 ] Hadoop QA commented on YARN-3830: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 58s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 45s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 49s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 36s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741871/YARN-3830_3.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / bc43390 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8347/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8347/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8347/console | This message was automatically generated. AbstractYarnScheduler.createReleaseCache may try to clean a null attempt Key: YARN-3830 URL: https://issues.apache.org/jira/browse/YARN-3830 Project: Hadoop YARN Issue Type: Bug Reporter: nijel Assignee: nijel Attachments: YARN-3830_1.patch, YARN-3830_2.patch, YARN-3830_3.patch org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.createReleaseCache() {code} protected void createReleaseCache() { // Cleanup the cache after nm expire interval. new Timer().schedule(new TimerTask() { @Override public void run() { for (SchedulerApplicationT app : applications.values()) { T attempt = app.getCurrentAppAttempt(); synchronized (attempt) { for (ContainerId containerId : attempt.getPendingRelease()) { RMAuditLogger.logFailure( {code} Here the attempt can be null since the attempt is created later. So null pointer exception will come {code} 2015-06-19 09:29:16,195 | ERROR | Timer-3 | Thread Thread[Timer-3,5,main] threw an Exception. | YarnUncaughtExceptionHandler.java:68 java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler$1.run(AbstractYarnScheduler.java:457) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) {code} This will skip the other applications in this run. Can add a null check and continue with other applications -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
[ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601755#comment-14601755 ] Ming Ma commented on YARN-221: -- Thanks. [~vinodkv] and others, any additional suggestions for the design? NM should provide a way for AM to tell it not to aggregate logs. Key: YARN-221 URL: https://issues.apache.org/jira/browse/YARN-221 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Reporter: Robert Joseph Evans Assignee: Ming Ma Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch The NodeManager should provide a way for an AM to tell it that either the logs should not be aggregated, that they should be aggregated with a high priority, or that they should be aggregated but with a lower priority. The AM should be able to do this in the ContainerLaunch context to provide a default value, but should also be able to update the value when the container is released. This would allow for the NM to not aggregate logs in some cases, and avoid connection to the NN at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1965) Interrupted exception when closing YarnClient
[ https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601701#comment-14601701 ] Mit Desai commented on YARN-1965: - Thanks for the patch [~kshukla]. I will review it shortly. Interrupted exception when closing YarnClient - Key: YARN-1965 URL: https://issues.apache.org/jira/browse/YARN-1965 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky Assignee: Kuhu Shukla Priority: Minor Labels: newbie Attachments: YARN-1965-v2.patch, YARN-1965.patch Its more of a nuisance then a bug, but nevertheless {code} 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting for clientExecutorto stop java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468) at org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191) at org.apache.hadoop.ipc.Client.stop(Client.java:1235) at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621) at org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57) at org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) . . . {code} It happens sporadically when stopping YarnClient. Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious why and who throws the interrupt but in any event it should not be logged as ERROR. Probably a WARN with no stack trace. Also, for consistency and correctness you may want to Interrupt current thread as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3853) Add docker container runtime support to LinuxContainterExecutor
Sidharta Seethana created YARN-3853: --- Summary: Add docker container runtime support to LinuxContainterExecutor Key: YARN-3853 URL: https://issues.apache.org/jira/browse/YARN-3853 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sidharta Seethana Assignee: Sidharta Seethana Create a new DockerContainerRuntime that implements support for docker containers via container-executor. LinuxContainerExecutor should default to current behavior when launching containers but switch to docker when requested. Till a first class ‘container type’ mechanism/API is available on the client side, we could potentially implement this via environment variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3852) Add docker container support to container-executor
[ https://issues.apache.org/jira/browse/YARN-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-3852: Target Version/s: 2.8.0 Add docker container support to container-executor --- Key: YARN-3852 URL: https://issues.apache.org/jira/browse/YARN-3852 Project: Hadoop YARN Issue Type: Sub-task Components: yarn Reporter: Sidharta Seethana Assignee: Abin Shahab For security reasons, we need to ensure that access to the docker daemon and the ability to run docker containers is restricted to privileged users ( i.e users running applications should not have direct access to docker). In order to ensure the node manager can run docker commands, we need to add docker support to the container-executor binary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602006#comment-14602006 ] Naganarasimha G R commented on YARN-3644: - As long as Refactoring is taken care in YARN-3847, i don't mind! I will try to review the patch as soon as possible Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan Assignee: Raju Bairishetti Attachments: YARN-3644.001.patch, YARN-3644.001.patch, YARN-3644.002.patch, YARN-3644.patch When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3850: --- Attachment: YARN-3850.01.patch Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup
Jian He created YARN-3855: - Summary: If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup Key: YARN-3855 URL: https://issues.apache.org/jira/browse/YARN-3855 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who and user cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup
[ https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3855: -- Description: If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who . User also cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default. (was: If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who and user cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default.) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup Key: YARN-3855 URL: https://issues.apache.org/jira/browse/YARN-3855 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who . User also cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup
[ https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3855: -- Description: If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who . User also cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default. This is inconvenient from user's perspective. (was: If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who . User also cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default.) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup Key: YARN-3855 URL: https://issues.apache.org/jira/browse/YARN-3855 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who . User also cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default. This is inconvenient from user's perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602112#comment-14602112 ] Varun Saxena commented on YARN-3850: Below also seems to be a problem. {{RecoveredContainerLaunch#locatePidFile}} Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602118#comment-14602118 ] Varun Saxena commented on YARN-3850: Raise a separate JIRA for this or fix it as part of this one ? Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3851) Add support for container runtimes in YARN
Sidharta Seethana created YARN-3851: --- Summary: Add support for container runtimes in YARN Key: YARN-3851 URL: https://issues.apache.org/jira/browse/YARN-3851 Project: Hadoop YARN Issue Type: Sub-task Components: yarn Reporter: Sidharta Seethana Assignee: Sidharta Seethana We need the ability to support different container types within the same executor. Container runtimes are lower-level implementations for supporting specific container engines (e.g docker). These are meant to be independent of executors themselves - a given executor (e.g LinuxContainerExecutor) could potentially switch between different container runtimes depending on what a client/application is requesting. An executor continues to provide higher level functionality that could be specific to an operating system - for example, LinuxContainerExecutor continues to handle cgroups, users, diagnostic events etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3851) Add support for container runtimes in YARN
[ https://issues.apache.org/jira/browse/YARN-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-3851: Target Version/s: 2.8.0 Add support for container runtimes in YARN --- Key: YARN-3851 URL: https://issues.apache.org/jira/browse/YARN-3851 Project: Hadoop YARN Issue Type: Sub-task Components: yarn Reporter: Sidharta Seethana Assignee: Sidharta Seethana We need the ability to support different container types within the same executor. Container runtimes are lower-level implementations for supporting specific container engines (e.g docker). These are meant to be independent of executors themselves - a given executor (e.g LinuxContainerExecutor) could potentially switch between different container runtimes depending on what a client/application is requesting. An executor continues to provide higher level functionality that could be specific to an operating system - for example, LinuxContainerExecutor continues to handle cgroups, users, diagnostic events etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602004#comment-14602004 ] Jason Lowe commented on YARN-3850: -- After thinking about this I was wondering if ShuffleHandler had a similar issue, since it too is looking for places to read files. It looks like it might not be affected in the same way, since it doesn't use LocalDirsHandlerService and just uses the underlying LocalDirAllocator. I don't think the latter will auto-update the list of bad/good directories, since it doesn't appear to update unless something tries to write through it or the conf is updated. I think it could be problematic in that the ShuffleHandler will likely continue to search disks that later go bad or fail to search disks that were bad/full on startup and later became good. If we start persisting bad/full disks across NM restart then it seems likely a map task could deposit shuffle data on a disk the ShuffleHandler will fail to search with its stale view of the disks on startup. What do you think? Should be addressed as a separate JIRA if a problem, but I'm trying to think of other places in the NM where we would have a similar bug and only searching good dirs for reading rather than also checking the full disks. Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup
[ https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602040#comment-14602040 ] Jian He commented on YARN-3855: --- Today, RMAuthenticationFilterInitializer is always added in non-secure mode. The proposal is to always add RMAuthenticationFilterInitializer too in secure mode so that if http.authentication.type is 'simple' , user can pass the user.name to indicate the incoming user name. If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup Key: YARN-3855 URL: https://issues.apache.org/jira/browse/YARN-3855 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who and user cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3705) forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state
[ https://issues.apache.org/jira/browse/YARN-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602101#comment-14602101 ] Xuan Gong commented on YARN-3705: - [~iwasakims] Thanks for working on this. Here is one issue for this patch. If we call resetLeaderElection inside the rmadmin.transitionToStandby(), it will cause a infinite loop. Basically, resetLeaderElection-- terminate and recreate zk client -- rejoin the leader elector -- transitionToStandby --resetLeaderElection Could you check this, please ? forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state Key: YARN-3705 URL: https://issues.apache.org/jira/browse/YARN-3705 Project: Hadoop YARN Issue Type: Sub-task Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: YARN-3705.001.patch, YARN-3705.002.patch, YARN-3705.003.patch, YARN-3705.004.patch, YARN-3705.005.patch Executing {{rmadmin -transitionToStandby --forcemanual}} in automatic-failover.enabled mode makes ResouceManager standby while keeping the state of ActiveStandbyElector. It should make elector to quit and rejoin in order to enable other candidates to promote, otherwise forcemanual transition should not be allowed in automatic-failover mode in order to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602115#comment-14602115 ] Hadoop QA commented on YARN-3850: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 27s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 58s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 55s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 42s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 15s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 17s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 45m 6s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741960/YARN-3850.01.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / aa5b15b | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8351/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8351/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8351/console | This message was automatically generated. Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3852) Add docker container support to container-executor
Sidharta Seethana created YARN-3852: --- Summary: Add docker container support to container-executor Key: YARN-3852 URL: https://issues.apache.org/jira/browse/YARN-3852 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sidharta Seethana Assignee: Abin Shahab For security reasons, we need to ensure that access to the docker daemon and the ability to run docker containers is restricted to privileged users ( i.e users running applications should not have direct access to docker). In order to ensure the node manager can run docker commands, we need to add docker support to the container-executor binary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3854) Add localization support for docker images
Sidharta Seethana created YARN-3854: --- Summary: Add localization support for docker images Key: YARN-3854 URL: https://issues.apache.org/jira/browse/YARN-3854 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sidharta Seethana Assignee: Sidharta Seethana We need the ability to localize images from HDFS and load them for use when launching docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3850: --- Attachment: YARN-3850.01.patch Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3850: --- Attachment: (was: YARN-3850.01.patch) Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602045#comment-14602045 ] Varun Saxena commented on YARN-3850: Yes this also looks like a problem. We should not use LocalDirAllocator for ShuffleHandler. I will look for other areas where similar problem can happen and update if I find something. Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602044#comment-14602044 ] Varun Saxena commented on YARN-3850: Yes this also looks like a problem. We should not use LocalDirAllocator for ShuffleHandler. I will look for other areas where similar problem can happen and update if I find something. Container logs can be lost if disk is full -- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch *Container logs* can be lost if disk has become bad(become 90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)