[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068248#comment-14068248 ] Wangda Tan commented on YARN-796: - Allen, I think what we was just talking about is how to support hard partition use case in YARN, aren't we? I'm surprised to get a -1 here, Nobody has ever said dynamic labeling from NM will not be supported. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2247: Attachment: apache-yarn-2247.3.patch {quote} bq.The current implementation uses the standard http authentication for hadoop. Users can set it to simple if they choose. I was trying to make the point that when kerberos authentication is not used, simple authentication is not implicitly set, isn't it? In this case, without the authentication filter, we cannot identify the user via HTTP interface, such that we cannot behave correctly for those operations that require the knowledge of user information, such as submit/kill an application. One step back, and let's look at the analog RPC interfaces. By default, the authentication is SIMPLE, and at the server side, we can still identify who the user is, such that the feature such as ACLs are is still working in the SIMPLE case. {quote} Got it. I've added support for simple auth in the default case. I also spoke with [~vinodkv] offline and we felt that in secure mode the default static user should not be allowed to submit jobs. I made that change as well. {quote} bq.For now I'd like to use the same configs as the standard hadoop http auth. I'm open to changing them if we feel strongly about it in the future. It's okay to keep the configuration same. Just think it out loudly. If so, you may not want to add RM_WEBAPP_USE_YARN_AUTH_FILTER at all, and not load YarnAuthenticationFilterInitializer programatically. The rationales behind them are similar. Previously, I tried to add TimelineAuthenticationFilterInitializer programmatically because I find the same http auth config applies to different daemons, and I think it's annoying that at a single node cluster, I want to config something only for timeline server, it will affect others. Afterwards, I tried to make timeline server to use a set of configs with timeline-service prefix. This is what we did for the RPC interface configurations. {quote} I see your point but I don't think forcing users to replicate existing configs makes sense at this point. The RM web interfaces are already controlled by the common http auth configs and I'd like to preserve that behaviour. {quote} bq.I didn't understand - can you explain further? Let's take RMWebServices#getApp as an example. Previously we don't have (at least don't know) the auth filter, such that we cannot detect the user. Therefore, we don't check the ACLs, and simply get the application from RMContext and return the user. Now, we have the auth filter, and we can identify the user. Hence, it's possible for use to fix this API to only return the application information to the user that has the access. This is also another reason why I suggest to always have authentication filter on, whether it is simple or kerberos. {quote} Agree with your tickets. {quote} bq.Am I looking at the wrong file? This is the right file, but I'm afraid it is not the correct logic. AuthenticationFilter accept null secret file. However, if we use AuthenticationFilterInitializer to construct AuthenticationFilter, the null case is denied. I previously open a ticket for this issue (HADOOP-10600). {quote} Thanks for pointing that out. Fixed. Allow RM web services users to authenticate using delegation tokens --- Key: YARN-2247 URL: https://issues.apache.org/jira/browse/YARN-2247 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068258#comment-14068258 ] Tsuyoshi OZAWA commented on YARN-2325: -- Hi [~zxu], could you explain the case when nodeUpdate is called after removeNode? IIUC, RMNodeImpl state machine assures that nodeUpdate is not called after removeNode without adding the node. In concrete, NodeRemovedSchedulerEvent arises only in ReconnectNodeTransition/DeactivateNodeTransition. * ReconnectNodeTransition assures that new node is added after the transition with same node id. * DeactivateNodeTransition assures that the nodes will be DECOMMISSIONED, and the node won't be updated. If the transition you mentioned occurs, it might be bug of state-machine transition. Please correct me if I'm wrong. need check whether node is null in nodeUpdate for FairScheduler Key: YARN-2325 URL: https://issues.apache.org/jira/browse/YARN-2325 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2325.000.patch need check whether node is null in nodeUpdate for FairScheduler. If nodeUpdate is called after removeNode, the getFSSchedulerNode will be null. If the node is null, we should return with error message. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068289#comment-14068289 ] Hadoop QA commented on YARN-2247: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656829/apache-yarn-2247.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4380//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4380//console This message is automatically generated. Allow RM web services users to authenticate using delegation tokens --- Key: YARN-2247 URL: https://issues.apache.org/jira/browse/YARN-2247 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068319#comment-14068319 ] zhihai xu commented on YARN-2325: - Hi Tsuyoshi OZAWA, thanks for your quick response to my patch. I agree to your points above. If this transition occurs, it might be bug in the code. My patch is just to make sure we return early to avoid a NullPointerException for some unexpected code error which cause the node being removed. I also find the current removeNode function did the same thing: check null pointer and return early. if (node == null) { return; } If you think my patch is not needed for NPE prevention. I am ok to close this JIRA. need check whether node is null in nodeUpdate for FairScheduler Key: YARN-2325 URL: https://issues.apache.org/jira/browse/YARN-2325 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2325.000.patch need check whether node is null in nodeUpdate for FairScheduler. If nodeUpdate is called after removeNode, the getFSSchedulerNode will be null. If the node is null, we should return with error message. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2325: Priority: Minor (was: Major) need check whether node is null in nodeUpdate for FairScheduler Key: YARN-2325 URL: https://issues.apache.org/jira/browse/YARN-2325 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2325.000.patch need check whether node is null in nodeUpdate for FairScheduler. If nodeUpdate is called after removeNode, the getFSSchedulerNode will be null. If the node is null, we should return with error message. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068332#comment-14068332 ] Naganarasimha G R commented on YARN-2301: - Hi [~jianhe] and [~zjshen] I have fixed the first 3 but while working for the 4th one with [~zjshen] comments bq. 4) May have an option to run as yarn container -list appId i had to pick few alternatives so wanted opinion from you guys before i go forward # Issue1 : As i understand this option should be in addition to ??yarn container -list Application Attempt ID??, As its CLI we can't have polymorphic parameters, so i thought of providing command like ??yarn container -list-forAppId Application ID??. Is this option fine ? And please note GetContainersRequestProto api will be changed # Issue2 : As [~zjshen] pointed out we need to take care for containers from Timeline server flow also. For this i would suggest few options #* Option1: For the given AppID When application is still running (RM flow), Only show containers for the current running application attempt irrespective of the earlier failure attempts. When application is finished(Timeline server flow), Only show containers for the last attempt of the application (whether its failed or succeded) #* Option2: Introduce new additional parameter like ??-all??,??-last??. And let ??-last?? behave similar to option 1 and the default option and for ??-all?? in both flows (RM and Timeline server flow) we display containers for each attempt. Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033_ALL.3.patch Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.3.patch Rebase against the latest trunk Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-2262: --- Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes
[ https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068363#comment-14068363 ] Junping Du commented on YARN-2013: -- Agree that the test failure is not related as it randomly show up in many other tests. +1. Will commit it shortly. The diagnostics is always the ExitCodeException stack when the container crashes Key: YARN-2013 URL: https://issues.apache.org/jira/browse/YARN-2013 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2013.1.patch, YARN-2013.2.patch, YARN-2013.3-2.patch, YARN-2013.3.patch, YARN-2013.4.patch, YARN-2013.5.patch When a container crashes, ExitCodeException will be thrown from Shell. Default/LinuxContainerExecutor captures the exception, put the exception stack into the diagnostic. Therefore, the exception stack is always the same. {code} String diagnostics = Exception from container-launch: \n + StringUtils.stringifyException(e) + \n + shExec.getOutput(); container.handle(new ContainerDiagnosticsUpdateEvent(containerId, diagnostics)); {code} In addition, it seems that the exception always has a empty message as there's no message from stderr. Hence the diagnostics is not of much use for users to analyze the reason of container crash. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2323) FairShareComparator creates too many Resource objects
[ https://issues.apache.org/jira/browse/YARN-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068389#comment-14068389 ] Hudson commented on YARN-2323: -- FAILURE: Integrated in Hadoop-Yarn-trunk #619 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/619/]) YARN-2323. FairShareComparator creates too many Resource objects (Hong Zhiguo via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612187) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java FairShareComparator creates too many Resource objects - Key: YARN-2323 URL: https://issues.apache.org/jira/browse/YARN-2323 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.6.0 Attachments: YARN-2323-2.patch, YARN-2323.patch Each call of {{FairShareComparator}} creates a new Resource object one: {code} Resource one = Resources.createResource(1); {code} At the volume of 1000 nodes and 1000 apps, the comparator will be called more than 10 million times per second, thus creating more than 10 million object one, which is unnecessary. Since the object one is read-only and is never referenced outside of comparator, we could make it static. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068397#comment-14068397 ] Hadoop QA commented on YARN-2033: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656841/YARN-2033_ALL.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 20 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.timeline.webapp.TestTimelineWebServices org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.TestAHSWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4381//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4381//console This message is automatically generated. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068411#comment-14068411 ] Zhijie Shen commented on YARN-1954: --- [~ozawa], I'm generally fine with the patch. Just some minor comments. 1. Throw IllegalArgumentException instead? {code} +if (checkEveryMillis = 0) { + checkEveryMillis = 1; +} {code} 2. What if checkEveryMillis 6? Maybe we can simply hard code the fix number of rounds to output a warning log. And don't output a warning long in each round, and a info log at regular intervals. How do you think? {code} +final int loggingCounterInitValue = 6 / checkEveryMillis; +int loggingCounter = loggingCounterInitValue; {code} 3. There's unnecessary changes in AMRMClientAsyncImpl 4. In TestAMRMClient #testWaitFor, can you justify countDownChecker#counter == 3 after waitFor. 5. Not necessary to be in a synchronized block if you don't use wait here. {code} +synchronized (callbackHandler.notifier) { + asyncClient.registerApplicationMaster(localhost, 1234, null); + asyncClient.waitFor(checker); +} {code} Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2304) TestRMWebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068431#comment-14068431 ] Zhijie Shen commented on YARN-2304: --- [~ozawa] and [~kj-ki], the test failures seem to go beyond TestRMWebServices* set: https://builds.apache.org/job/PreCommit-YARN-Build/4381//testReport/ We didn't change Jersey dependency recently, but the test cases didn't fail before. Is it likely that some other test which uses JerseyTest has hung, such that the following tests cannot bind the same port? TestRMWebServices* fails intermittently --- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2313: - Attachment: YARN-2313.4.patch Thanks for the review, [~sandyr]. Updated a patch: * Moved new configuration definition to FairSchedulerConfiguration. * Excluded warning by updateInterval. * Replaced warning message use with using * Removed trailing spaces. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, YARN-2313.4.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068444#comment-14068444 ] Zhijie Shen commented on YARN-2319: --- [~gujilangzi], it seems that setupKDC is not need to be invoked with every parameterized round, isn't it? It can be done before class. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068449#comment-14068449 ] Devaraj K commented on YARN-1342: - Thanks for updating the patch. Good work [~jlowe]. Here are some comments for the patch. 1. NodeManager.java * NMContainerTokenSecretManager gets the nmStore through constructor and recover() also gets state as argument which is from nmStore. I think we can get the state from nmStore inside recover() instead of getting as an argument. {code:xml}+ containerTokenSecretManager.recover(nmStore.loadContainerTokenState());{code} {code:xml}+new NMContainerTokenSecretManager(conf, nmStore);{code} This same applies for NMTokenSecretManagerInNM also. 2. NMLeveldbStateStoreService.java * Here e.getMessage() may not be required to pass as message since we are wrapping the same exception. If we have some custom message we could pass it otherwise we can simply use like new IOException(e). {code:xml}throw new IOException(e.getMessage(), e);{code} * Can we move the CONTAINER_TOKENS_KEY_PREFIX.length() to outside of the while loop? {code:xml} String key = fullKey.substring(CONTAINER_TOKENS_KEY_PREFIX.length()); {code} * Can we make the string *container_* as a constant? {code:xml} } else if (key.startsWith(container_)) { {code} 3. What do you think of having the names like RecoveredContainerTokensState, loadContainerTokensState for RecoveredContainerTokenState, loadContainerTokesState since these handle more than one ContainerToken? Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1342.patch, YARN-1342v2.patch, YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2304) TestRMWebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068457#comment-14068457 ] Tsuyoshi OZAWA commented on YARN-2304: -- Hi Zhijie, thank you for the comment. {quote} Is it likely that some other test which uses JerseyTest has hung, such that the following tests cannot bind the same port? {quote} Yes, I thought other test which uses JerseyTest has hung and fails to cleanup. I checked recents commits to confirm it since the end of June, but cannot find the reason for now. I wrote my thought on YARN-2304: {quote} I think there are two kind of possibility - 1. new added tests recently on Web-Services didn't do proper cleanup as you mentioned 2. infra change for Jenkins CI influences and the problem appear. As I mentioned on YARN-2304, we don't support parallel test now because JerseyTest 1.9 doesn't support it. If the test run in parallel on single machine, it fails. {quote} I don't know how Jenkins infra changed, so it's one of possibility and just my anticipation. TestRMWebServices* fails intermittently --- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2326) WritingYarnApplications.html includes deprecated information
Tsuyoshi OZAWA created YARN-2326: Summary: WritingYarnApplications.html includes deprecated information Key: YARN-2326 URL: https://issues.apache.org/jira/browse/YARN-2326 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Tsuyoshi OZAWA For example, YarnConfiguration.YARN_SECURITY_INFO is removed on MAPREDUCE-3013, but it still mention it {code} appsManagerServerConf.setClass( YarnConfiguration.YARN_SECURITY_INFO, ClientRMSecurityInfo.class, SecurityInfo.class); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068474#comment-14068474 ] Karthik Kambatla commented on YARN-2273: Yep. Thanks for pointing it out, Tsuyoshi. That makes sense. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068478#comment-14068478 ] Hadoop QA commented on YARN-2313: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656855/YARN-2313.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4382//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4382//console This message is automatically generated. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, YARN-2313.4.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068500#comment-14068500 ] Karthik Kambatla commented on YARN-2273: +1. Checking this in. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2323) FairShareComparator creates too many Resource objects
[ https://issues.apache.org/jira/browse/YARN-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068506#comment-14068506 ] Hudson commented on YARN-2323: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1838 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1838/]) YARN-2323. FairShareComparator creates too many Resource objects (Hong Zhiguo via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612187) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java FairShareComparator creates too many Resource objects - Key: YARN-2323 URL: https://issues.apache.org/jira/browse/YARN-2323 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.6.0 Attachments: YARN-2323-2.patch, YARN-2323.patch Each call of {{FairShareComparator}} creates a new Resource object one: {code} Resource one = Resources.createResource(1); {code} At the volume of 1000 nodes and 1000 apps, the comparator will be called more than 10 million times per second, thus creating more than 10 million object one, which is unnecessary. Since the object one is read-only and is never referenced outside of comparator, we could make it static. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068508#comment-14068508 ] Karthik Kambatla commented on YARN-2273: Actually, let me retract that +1 temporarily. Can we add a test case here? We can move while(true) into the run method and rename continuousScheduling to continuousSchedulingAttempt. The test from replayException can be used for the test, if we move the fail() to catch-block. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2323) FairShareComparator creates too many Resource objects
[ https://issues.apache.org/jira/browse/YARN-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068537#comment-14068537 ] Hudson commented on YARN-2323: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1811 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1811/]) YARN-2323. FairShareComparator creates too many Resource objects (Hong Zhiguo via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612187) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java FairShareComparator creates too many Resource objects - Key: YARN-2323 URL: https://issues.apache.org/jira/browse/YARN-2323 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.6.0 Attachments: YARN-2323-2.patch, YARN-2323.patch Each call of {{FairShareComparator}} creates a new Resource object one: {code} Resource one = Resources.createResource(1); {code} At the volume of 1000 nodes and 1000 apps, the comparator will be called more than 10 million times per second, thus creating more than 10 million object one, which is unnecessary. Since the object one is read-only and is never referenced outside of comparator, we could make it static. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068545#comment-14068545 ] Jason Lowe commented on YARN-1198: -- We need to worry about headroom as a report of resources that can be allocated if requested. If the AM cannot currently allocate any containers then the headroom should be reported as zero. I think guaranteed headroom is a separate JIRA and not necessary to solve the deadlock issues surrounding the current headroom reporting. bq. Because in a dynamic cluster, the number can change rapidly, it is possible that a cluster is fulfilled by another application just happens one second after the AM got the available headroom. Sure, this can happen. However on the next heartbeat the headroom will be reported as less than it was before, and the AM can take appropriate action. I don't see this as a major issue at least in the short-term. Telling an AM repeatedly that it can allocate resources that will never be allocated in practice is definitely wrong and needs to be fixed. bq. And also, this field can not solve the deadlock problem as well, a malicious application can ask much more resource of this, or a careless developer totally ignore this field. A malicious application cannot cause another application to deadlock as long as the YARN scheduler properly enforces user limits and properly reports the headroom to applications. It seems to me the worst case is an application hurts itself, but since the entire application can be custom user code there's not much YARN can do to prevent that. bq. The only valid solution in my head is putting such logic into scheduler side, and enforce resource usage by preemption policy. The problem is that the scheduler does not, and IMHO should not, know the details of the particular application. For example, let's say an application's headroom goes to zero but is has outstanding allocation requests. Should the YARN scheduler automatically preempt something when this occurs? If so which container does it preempt? These are questions an AM can answer optimally, including an answer of preempting nothing (e.g.: task is completing imminently), while I don't see how the YARN scheduler can make good decisions without either putting application-specific logic in the YARN scheduler or having the YARN scheduler defer to the AM to make the decision. Reporting the headroom to the AM enables the AM to make an application-optimal decision of what to do, if anything, when the available resources to the application changes. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1198.1.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068546#comment-14068546 ] Zhijie Shen commented on YARN-2033: --- The test failures are not related. See YARN-2304. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068587#comment-14068587 ] Hudson commented on YARN-2045: -- FAILURE: Integrated in Hadoop-trunk-Commit #5922 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5922/]) YARN-2045. Data persisted in NM should be versioned. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612285) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/NMDBSchemaVersion.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl/pb * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/records/impl/pb/NMDBSchemaVersionPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045-v6.patch, YARN-2045-v7.patch, YARN-2045.patch As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068594#comment-14068594 ] Chen He commented on YARN-1198: --- {quote} With preemption, resource beyond guaranteed resource will be likely preempted. It should be consider as a temporary resource. {quote} One thing needs to be clarified about preemption. I think we can resolve YARN-2008 without introducing preemption. Because if we allow preemption before define priority, it is wasting time and resource to let thousand of AMs to compete those temporary resources repeatedly. priority is the most important factor in preemption of scheduling. I think, in this JIRA, we are talking about how to efficiently and relatively accurate get headroom in capacity scheduler. Preemption is another story. Here is how preemption defined in scheduling: In computing, preemption is the act of temporarily interrupting a task being carried out by a computer system, without requiring its cooperation, and with the intention of resuming the task at a later time. Such a change is known as a context switch. It is normally carried out by a privileged task or part of the system known as a preemptive scheduler, which has the power to preempt, or interrupt, and later resume, other tasks in the system. refer to Preemption from wikipedia [http://en.wikipedia.org/wiki/Preemption_%28computing%29] Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1198.1.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2262: -- Issue Type: Sub-task (was: Bug) Parent: YARN-1530 Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068597#comment-14068597 ] Zhijie Shen commented on YARN-2262: --- [~nishan], do you have more details about the issue? Does this application already finish before RM restarting? Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2262: -- Issue Type: Sub-task (was: Bug) Parent: YARN-321 Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2262: -- Issue Type: Bug (was: Sub-task) Parent: (was: YARN-1530) Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2304) TestRMWebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068607#comment-14068607 ] Zhijie Shen commented on YARN-2304: --- I saw we've opened a number of Test*WebServices* tickets, but I believe the problem was the same. Shall we consolidate all of them? TestRMWebServices* fails intermittently --- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2304) Test*WebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2304: - Summary: Test*WebServices* fails intermittently (was: TestRMWebServices* fails intermittently) Test*WebServices* fails intermittently -- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2316) TestNMWebServices* get failed on trunk
[ https://issues.apache.org/jira/browse/YARN-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA resolved YARN-2316. -- Resolution: Duplicate TestNMWebServices* get failed on trunk -- Key: YARN-2316 URL: https://issues.apache.org/jira/browse/YARN-2316 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du From Jenkins test in YARN-2045 and YARN-1341, these tests get failed with address already get bind. The similar issue happens at RMWebService (YARN-2304) and AMWebService (MAPREDUCE-5973) as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2304) Test*WebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068624#comment-14068624 ] Tsuyoshi OZAWA commented on YARN-2304: -- OK, let's deal with the problem on this JIRA. Test*WebServices* fails intermittently -- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2316) TestNMWebServices* get failed on trunk
[ https://issues.apache.org/jira/browse/YARN-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068628#comment-14068628 ] Tsuyoshi OZAWA commented on YARN-2316: -- Close this as dup off YARN-2304, because the cause of YARN-2304, YARN-2316, MAPREDUCE-5973 look same. Let's consolidate them and deal with on YARN-2304. TestNMWebServices* get failed on trunk -- Key: YARN-2316 URL: https://issues.apache.org/jira/browse/YARN-2316 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du From Jenkins test in YARN-2045 and YARN-1341, these tests get failed with address already get bind. The similar issue happens at RMWebService (YARN-2304) and AMWebService (MAPREDUCE-5973) as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2322) Provide Cli to refesh Admin Acls for Timeline server
[ https://issues.apache.org/jira/browse/YARN-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068633#comment-14068633 ] Zhijie Shen commented on YARN-2322: --- In fact, we need an admin service for the timeline server. Provide Cli to refesh Admin Acls for Timeline server Key: YARN-2322 URL: https://issues.apache.org/jira/browse/YARN-2322 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Karam Singh Provide Cli to refresh Admin Acls for Timelineserver. Currently rmadmin -refreshAdminAcls provides facility to refresh Admin Acls for ResourceManager. But If we want modify adminAcls from Timelineserver, then we need to restart it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2322) Provide Cli to refesh Admin Acls for Timeline server
[ https://issues.apache.org/jira/browse/YARN-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2322: -- Issue Type: Sub-task (was: Improvement) Parent: YARN-1530 Provide Cli to refesh Admin Acls for Timeline server Key: YARN-2322 URL: https://issues.apache.org/jira/browse/YARN-2322 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Karam Singh Provide Cli to refresh Admin Acls for Timelineserver. Currently rmadmin -refreshAdminAcls provides facility to refresh Admin Acls for ResourceManager. But If we want modify adminAcls from Timelineserver, then we need to restart it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2304) Test*WebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2304: - Description: TestNMWebService, TestRMWebService, and TestAMWebService get failed with address already get bind. was:The test fails intermittently because of bind exception. Test*WebServices* fails intermittently -- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt TestNMWebService, TestRMWebService, and TestAMWebService get failed with address already get bind. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068640#comment-14068640 ] Tsuyoshi OZAWA commented on YARN-2273: -- Sounds reasonable. [~ywskycn], could you update to address Karthik's comment? NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068684#comment-14068684 ] Wei Yan commented on YARN-2273: --- [~kasha], [~ozawa], will update a patch soon. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068704#comment-14068704 ] Alejandro Abdelnur commented on YARN-796: - Wandga, previously I've missed the new doc explaining label predicates. Thanks for pointing it out. How about first shooting for the following? * RM has list of valid labels. (hot reloadable) * NMs have list of labels. (hot reloadable) * NMs report labels at registration time and on heartbeats when they change * label-expressions support (AND) only * app able to specify a label-expression when making a resource request * queues to AND augment the label expression with the queue label-expression And later we can add (in a backwards compatible way) * add support for OR and NOT to label-expressions * add label ACLs * centralized per NM configuration, REST API for it, etc, etc Thoughts? Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068735#comment-14068735 ] Jian He commented on YARN-2295: --- patch looks good overall, one comment: we can have the newInstance include the following setResource method also. {code} LocalResource shellRsrc = LocalResource.newInstance(null, LocalResourceType.FILE, LocalResourceVisibility.APPLICATION, shellScriptPathLen, shellScriptPathTimestamp); try { shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI( renamedScriptPath.toString(; } {code} Refactor YARN distributed shell with existing public stable API --- Key: YARN-2295 URL: https://issues.apache.org/jira/browse/YARN-2295 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, YARN-2295-071514.patch Some API calls in YARN distributed shell have been marked as unstable and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2273: -- Attachment: YARN-2273.patch NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068791#comment-14068791 ] Craig Welch commented on YARN-1198: --- [~wangda] I concur with [~jlowe] and [~airbots] that these headroom fixes (incl [YARN-2008]) should happen. I don't think that this is a redefinition of headroom, headroom remains the maximum resource of an application can get - the application can't get resources which are not available because they are in use, which is what the change addresses. I think of this change as really only being a fix for a missed case - and it will in fact return the same value as it does today except under some specific cases of higher cluster utilization, in which case the value it returns will actually be better than it's current behavior in terms of helping the AM to work accurately and preventing some known deadlock conditions. This kind of behavior is a necessary consequence of allowing oversubscription of cluster resources vis - a - vis the maximum allocation which is greater than the baseline (and which in aggregate can be 100%), and this oversubscription is a reasonable design choice to allow applications to burst above their guaranteed level when other queues are less utilized. As I mentioned on [YARN-2008], since the aggregate maximum can be 100% it's not possible to solve this solely with preemption - AM's will still be getting higher values than are available without this correction - and retaining the max behavior for the reasons above, this kind of approach is going to be the way to go. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1198.1.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068810#comment-14068810 ] Zhijie Shen commented on YARN-2301: --- bq. Issue1 : As i understand this option should be in addition to yarn container -list Application Attempt ID, As its CLI we can't have polymorphic parameters, How about having -list only, and then parsing whether the given id is app id or app attempt id? bq. Issue2 : As Zhijie Shen pointed out we need to take care for containers from Timeline server flow also. For this i would suggest few options Option 2 sounds more attractive to me if it doesn't introduce too much complexity. [~jianhe], I don't tracking the recent scheduler changes. Is it able to show the containers of previous app attempt, or the finished containers of the current app attempt? Previously, the container is removed from the scheduler if it is finished. If so, -all can only work for the application in the timeline server but not the running app in RM. Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068834#comment-14068834 ] Karthik Kambatla commented on YARN-2273: Thanks Wei. Few comments on the latest patch, some not specific to changes in this patch. # Can continuousSchedulingAttempt be package-private? # We should log the following at ERROR level {code} + } catch (Throwable ex) { +LOG.warn(Error while attempting scheduling for node + node + +: + ex.toString(), ex); } {code} # When the scheduling thread is interrupted, shouldn't we actually stop the thread? What are the cases where we want to ignore an interruption? # Update the log message in the catch-block of InterruptedException - Continuous scheduling thread interrupted. May be add Exiting. if we do decide to shut the thread down. # In the test, do we need to call FS#reinitialize()? # In the test, should we catch all exceptions instead of just NPE NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068872#comment-14068872 ] Jian He commented on YARN-2211: --- - AMRMTokenSecretManagerState class is not needed. we may just use AMRMTokenSecretManagerStateData and rename AMRMTokenSecretManagerStateData to AMRMTokenSecretManagerState. we should probably remove other ApplicationState etc. in a different jira too. - Fix the naming, similarly AMRMTokenSecretManagerStateData#get/setCurrentTokenMasterKey {code} *|- currentTokenMasterKey *|- nextTokenMasterKey {code} - existsWithRetries(amrmTokenSecretManagerRoot, true) != null; change to use isUpdate flag RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch After YARN-2208, AMRMToken can be rolled over periodically. We need to save related Master Keys and use them to recover the AMRMToken when RM restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068879#comment-14068879 ] Anubhav Dhoot commented on YARN-1372: - Yes. Working on this now. Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2273: -- Attachment: YARN-2273.patch Thanks, [~kasha], update a new patch to address the comments. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068910#comment-14068910 ] Hadoop QA commented on YARN-2273: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656888/YARN-2273.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4383//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4383//console This message is automatically generated. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068947#comment-14068947 ] Craig Welch commented on YARN-1198: --- Also - this combined with preemption will be desirable behavior - as preeption rebalances, this logic will properly (accurately) raise the headroom value for an application - since the AM understands the particulars of it's own task ordering, it will need to know what resources it actually has to work with as preemption frees them in order to make optimal use of them. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1198.1.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2327) YARN should warn about nodes with poor clock synchronization
[ https://issues.apache.org/jira/browse/YARN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated YARN-2327: --- Summary: YARN should warn about nodes with poor clock synchronization (was: A hadoop cluster needs clock synchronization) YARN should warn about nodes with poor clock synchronization Key: YARN-2327 URL: https://issues.apache.org/jira/browse/YARN-2327 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen As a distributed system, a hadoop cluster wants the clock on all the participating hosts synchronized. Otherwise, some problems might happen. For example, in YARN-2251, due to the clock on the host for the task container falls behind that on the host of the AM container, the computed elapsed time (the diff between the timestamps produced on two hosts) becomes negative. In YARN-2251, we tried to mask the negative elapsed time. However, we should seek for a decent long term solution, such as providing mechanism to do and check clock synchronization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2327) A hadoop cluster needs clock synchronization
[ https://issues.apache.org/jira/browse/YARN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe moved HADOOP-10794 to YARN-2327: - Key: YARN-2327 (was: HADOOP-10794) Project: Hadoop YARN (was: Hadoop Common) A hadoop cluster needs clock synchronization Key: YARN-2327 URL: https://issues.apache.org/jira/browse/YARN-2327 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen As a distributed system, a hadoop cluster wants the clock on all the participating hosts synchronized. Otherwise, some problems might happen. For example, in YARN-2251, due to the clock on the host for the task container falls behind that on the host of the AM container, the computed elapsed time (the diff between the timestamps produced on two hosts) becomes negative. In YARN-2251, we tried to mask the negative elapsed time. However, we should seek for a decent long term solution, such as providing mechanism to do and check clock synchronization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2327) YARN should warn about nodes with poor clock synchronization
[ https://issues.apache.org/jira/browse/YARN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated YARN-2327: --- Description: YARN should warn about nodes with poor clock synchronization. YARN relies on approximate clock synchronization to report certain elapsed time statistics (see YARN-2251), but we currently don't warn if this assumption is violated. was: YARN should warn about nodes with poor clock synchronization. YARN statistics rely on approximate clock synchronization in a few cases (see YARN-2251), but we currently don't warn if this assumption is violated. YARN should warn about nodes with poor clock synchronization Key: YARN-2327 URL: https://issues.apache.org/jira/browse/YARN-2327 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen YARN should warn about nodes with poor clock synchronization. YARN relies on approximate clock synchronization to report certain elapsed time statistics (see YARN-2251), but we currently don't warn if this assumption is violated. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2327) YARN should warn about nodes with poor clock synchronization
[ https://issues.apache.org/jira/browse/YARN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated YARN-2327: --- Description: YARN should warn about nodes with poor clock synchronization. YARN statistics rely on approximate clock synchronization in a few cases (see YARN-2251), but we currently don't warn if this assumption is violated. was: As a distributed system, a hadoop cluster wants the clock on all the participating hosts synchronized. Otherwise, some problems might happen. For example, in YARN-2251, due to the clock on the host for the task container falls behind that on the host of the AM container, the computed elapsed time (the diff between the timestamps produced on two hosts) becomes negative. In YARN-2251, we tried to mask the negative elapsed time. However, we should seek for a decent long term solution, such as providing mechanism to do and check clock synchronization. YARN should warn about nodes with poor clock synchronization Key: YARN-2327 URL: https://issues.apache.org/jira/browse/YARN-2327 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen YARN should warn about nodes with poor clock synchronization. YARN statistics rely on approximate clock synchronization in a few cases (see YARN-2251), but we currently don't warn if this assumption is violated. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069040#comment-14069040 ] Hadoop QA commented on YARN-2273: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656909/YARN-2273.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4384//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4384//console This message is automatically generated. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069077#comment-14069077 ] Karthik Kambatla commented on YARN-2273: Still missing a return in this block: {code} } catch (InterruptedException e) { LOG.error(Continuous scheduling thread interrupted. Exiting. , e); } {code} Unrelated, I think ContinuousSchedulingThread should be a separate class like UpdateThread. Both should extend Thread and be singleton classes. We can address this in another JIRA. In that JIRA, we should also add a test to make sure FairScheduler#stop stops both the threads. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2328) FairScheduler: Update and Continuous-Scheduling threads should be singletons
Karthik Kambatla created YARN-2328: -- Summary: FairScheduler: Update and Continuous-Scheduling threads should be singletons Key: YARN-2328 URL: https://issues.apache.org/jira/browse/YARN-2328 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor FairScheduler threads can use a little cleanup and tests. To begin with, the update and continuous-scheduling threads should extend Thread and be singletons. We should have tests for starting and stopping them as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069085#comment-14069085 ] Karthik Kambatla commented on YARN-2273: Filed YARN-2328 for the latter comment. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0, 2.4.1 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton Attachments: YARN-2273-replayException.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069092#comment-14069092 ] Karthik Kambatla commented on YARN-2131: +1 to the addendum. I ll commit this later today if no one objects. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-671) Add an interface on the RM to move NMs into a maintenance state
[ https://issues.apache.org/jira/browse/YARN-671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069124#comment-14069124 ] Jian Fang commented on YARN-671: YARN-796 is adding label support for NMs, could label based schedule be used to ask RM not to assign any new tasks to the NMs that will be decommissioned to give some grace time for the decommission? Add an interface on the RM to move NMs into a maintenance state --- Key: YARN-671 URL: https://issues.apache.org/jira/browse/YARN-671 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Assignee: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069175#comment-14069175 ] Jian Fang commented on YARN-796: * RM has list of valid labels. (hot reloadable) This requires that RM has a global picture of the cluster before it starts, which is unlikely to be true in our use case where we provide hadoop as a cloud platform and the RM does not have any information about the slave nodes until they join the cluster. Why not just treat all registered lables from NMs as valid ones? Label validation could be just for resource requests. * label-expressions support (AND) only At least in our use case, OR is often used, not AND Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2211: Attachment: YARN-2211.6.patch RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, YARN-2211.6.patch After YARN-2208, AMRMToken can be rolled over periodically. We need to save related Master Keys and use them to recover the AMRMToken when RM restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement
[ https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2321: - Summary: NodeManager web UI can incorrectly report Pmem enforcement (was: NodeManager WebUI get wrong configuration of isPmemCheckEnabled()) +1 lgtm. Committing this. NodeManager web UI can incorrectly report Pmem enforcement -- Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Attachments: YARN-2321.patch WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2329) Machine List generated by machines.jsp should be sorted
[ https://issues.apache.org/jira/browse/YARN-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer moved MAPREDUCE-498 to YARN-2329: -- Key: YARN-2329 (was: MAPREDUCE-498) Project: Hadoop YARN (was: Hadoop Map/Reduce) Machine List generated by machines.jsp should be sorted --- Key: YARN-2329 URL: https://issues.apache.org/jira/browse/YARN-2329 Project: Hadoop YARN Issue Type: Improvement Reporter: Tim Williamson Priority: Minor Attachments: HADOOP-5586.patch The listing of machines shown by machine.jsp is arbitrarily ordered. It would be a more useful to sort them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement
[ https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069426#comment-14069426 ] Hudson commented on YARN-2321: -- FAILURE: Integrated in Hadoop-trunk-Commit #5927 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5927/]) YARN-2321. NodeManager web UI can incorrectly report Pmem enforcement. Contributed by Leitao Guo (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612411) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/NodePage.java NodeManager web UI can incorrectly report Pmem enforcement -- Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Assignee: Leitao Guo Fix For: 3.0.0, 2.6.0 Attachments: YARN-2321.patch WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069453#comment-14069453 ] Hadoop QA commented on YARN-2211: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656964/YARN-2211.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4385//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4385//console This message is automatically generated. RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, YARN-2211.6.patch After YARN-2208, AMRMToken can be rolled over periodically. We need to save related Master Keys and use them to recover the AMRMToken when RM restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1726: --- Priority: Blocker (was: Minor) Target Version/s: 2.5.0 Affects Version/s: 2.4.1 Given the scheduler-load-simulator is broken without this, I am making it a blocker for 2.5. ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041 Key: YARN-1726 URL: https://issues.apache.org/jira/browse/YARN-1726 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.1 Reporter: Wei Yan Assignee: Wei Yan Priority: Blocker Attachments: YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch The YARN scheduler simulator failed when running Fair Scheduler, due to AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper should inherit AbstractYarnScheduler, instead of implementing ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069522#comment-14069522 ] Hadoop QA commented on YARN-1726: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638118/YARN-1726.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4386//console This message is automatically generated. ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041 Key: YARN-1726 URL: https://issues.apache.org/jira/browse/YARN-1726 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.1 Reporter: Wei Yan Assignee: Wei Yan Priority: Blocker Attachments: YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch The YARN scheduler simulator failed when running Fair Scheduler, due to AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper should inherit AbstractYarnScheduler, instead of implementing ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2211: Attachment: YARN-2211.6.1.patch fix the testcase failures RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, YARN-2211.6.1.patch, YARN-2211.6.patch After YARN-2208, AMRMToken can be rolled over periodically. We need to save related Master Keys and use them to recover the AMRMToken when RM restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2295: Attachment: YARN-2295-072114.patch Thanks [~jianhe]! I've addressed the problem in the latest patch. If you've got time please take a look at it. Thanks! Refactor YARN distributed shell with existing public stable API --- Key: YARN-2295 URL: https://issues.apache.org/jira/browse/YARN-2295 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, YARN-2295-071514.patch, YARN-2295-072114.patch Some API calls in YARN distributed shell have been marked as unstable and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069588#comment-14069588 ] Hadoop QA commented on YARN-2295: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657002/YARN-2295-072114.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4387//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4387//console This message is automatically generated. Refactor YARN distributed shell with existing public stable API --- Key: YARN-2295 URL: https://issues.apache.org/jira/browse/YARN-2295 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, YARN-2295-071514.patch, YARN-2295-072114.patch Some API calls in YARN distributed shell have been marked as unstable and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2316) TestNMWebServices* get failed on trunk
[ https://issues.apache.org/jira/browse/YARN-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069593#comment-14069593 ] Junping Du commented on YARN-2316: -- +1 on resolving it as duplicated. It looks like these failures are all for the same reason. TestNMWebServices* get failed on trunk -- Key: YARN-2316 URL: https://issues.apache.org/jira/browse/YARN-2316 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du From Jenkins test in YARN-2045 and YARN-1341, these tests get failed with address already get bind. The similar issue happens at RMWebService (YARN-2304) and AMWebService (MAPREDUCE-5973) as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069615#comment-14069615 ] Wangda Tan commented on YARN-796: - Hi Tucu, Thanks for providing thoughts about how to stage development works. It's reasonable and we're trying to scope work for first shooting as well. Will keep you posted. Thanks, Wangda Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069619#comment-14069619 ] Wangda Tan commented on YARN-796: - Jian Fang, I think it's make sense to make RM has a global picture because we can prevent typos created by admin manually filling labels on NM config, etc. In another hand, I think your use case is also reasonable, We'd better need to support both of them, as well as OR label expression. Will keep you posted when we made a plan. Thanks, Wangda Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069644#comment-14069644 ] Hadoop QA commented on YARN-2211: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657000/YARN-2211.6.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4388//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4388//console This message is automatically generated. RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, YARN-2211.6.1.patch, YARN-2211.6.patch After YARN-2208, AMRMToken can be rolled over periodically. We need to save related Master Keys and use them to recover the AMRMToken when RM restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069651#comment-14069651 ] Wangda Tan commented on YARN-1198: -- I agree with [~jlowe], [~airbots] and [~cwelch], used resource should be considered into headroom (which is YANR-2008). And apparently, application master can ask more than that number to get more resource possibly. I completely agree with what Jason mentioned, ignore headroom will not cause more problem except application itself. What I originally want to say is when putting headroom and gang scheduling together, it will cause deadlock problem and should be solved in scheduler side. But it seems kind of off-topic, let's ignore it here. Also, as Chen mentioned, we don't need consider preemption when computing headroom. And besides, when resource will be preempted from an app, the AM will receive messages about preemption requests, it should handle itself. Thanks, Wangda Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1198.1.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069662#comment-14069662 ] Xuan Gong commented on YARN-2211: - testcase failures are un-related. All the testcases can be passed on my local machine. RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, YARN-2211.6.1.patch, YARN-2211.6.patch After YARN-2208, AMRMToken can be rolled over periodically. We need to save related Master Keys and use them to recover the AMRMToken when RM restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2321) NodeManager web UI can incorrectly report Pmem enforcement
[ https://issues.apache.org/jira/browse/YARN-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069661#comment-14069661 ] Leitao Guo commented on YARN-2321: -- Thanks Jason Lowe! NodeManager web UI can incorrectly report Pmem enforcement -- Key: YARN-2321 URL: https://issues.apache.org/jira/browse/YARN-2321 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Assignee: Leitao Guo Fix For: 3.0.0, 2.6.0 Attachments: YARN-2321.patch WebUI of NodeManager get the wrong configuration of Pmem enforcement enable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2131. Resolution: Fixed Thanks Robert. Committed addendum to trunk and branch-2. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069673#comment-14069673 ] Hudson commented on YARN-2131: -- FAILURE: Integrated in Hadoop-trunk-Commit #5930 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5930/]) YARN-2131. Addendum. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612443) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1726: -- Attachment: YARN-1726-5.patch Rebase the patch. ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041 Key: YARN-1726 URL: https://issues.apache.org/jira/browse/YARN-1726 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.1 Reporter: Wei Yan Assignee: Wei Yan Priority: Blocker Attachments: YARN-1726-5.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch The YARN scheduler simulator failed when running Fair Scheduler, due to AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper should inherit AbstractYarnScheduler, instead of implementing ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes
[ https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069765#comment-14069765 ] Tsuyoshi OZAWA commented on YARN-2013: -- Thanks for your review, [~djp]! The diagnostics is always the ExitCodeException stack when the container crashes Key: YARN-2013 URL: https://issues.apache.org/jira/browse/YARN-2013 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, YARN-2013.3-2.patch, YARN-2013.3.patch, YARN-2013.4.patch, YARN-2013.5.patch When a container crashes, ExitCodeException will be thrown from Shell. Default/LinuxContainerExecutor captures the exception, put the exception stack into the diagnostic. Therefore, the exception stack is always the same. {code} String diagnostics = Exception from container-launch: \n + StringUtils.stringifyException(e) + \n + shExec.getOutput(); container.handle(new ContainerDiagnosticsUpdateEvent(containerId, diagnostics)); {code} In addition, it seems that the exception always has a empty message as there's no message from stderr. Hence the diagnostics is not of much use for users to analyze the reason of container crash. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes
[ https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069768#comment-14069768 ] Hudson commented on YARN-2013: -- FAILURE: Integrated in Hadoop-trunk-Commit #5931 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5931/]) YARN-2013. The diagnostics is always the ExitCodeException stack when the container crashes. (Contributed by Tsuyoshi OZAWA) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612449) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java The diagnostics is always the ExitCodeException stack when the container crashes Key: YARN-2013 URL: https://issues.apache.org/jira/browse/YARN-2013 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-2013.1.patch, YARN-2013.2.patch, YARN-2013.3-2.patch, YARN-2013.3.patch, YARN-2013.4.patch, YARN-2013.5.patch When a container crashes, ExitCodeException will be thrown from Shell. Default/LinuxContainerExecutor captures the exception, put the exception stack into the diagnostic. Therefore, the exception stack is always the same. {code} String diagnostics = Exception from container-launch: \n + StringUtils.stringifyException(e) + \n + shExec.getOutput(); container.handle(new ContainerDiagnosticsUpdateEvent(containerId, diagnostics)); {code} In addition, it seems that the exception always has a empty message as there's no message from stderr. Hence the diagnostics is not of much use for users to analyze the reason of container crash. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenwu Peng updated YARN-2319: - Attachment: YARN-2319.1.patch Thanks [~zjshen] great comments. the new patch address [~zjshen] 's comment to re-factor setupKDC to BeforeClass Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069782#comment-14069782 ] Hadoop QA commented on YARN-1726: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657041/YARN-1726-5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4390//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4390//console This message is automatically generated. ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041 Key: YARN-1726 URL: https://issues.apache.org/jira/browse/YARN-1726 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.1 Reporter: Wei Yan Assignee: Wei Yan Priority: Blocker Attachments: YARN-1726-5.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch The YARN scheduler simulator failed when running Fair Scheduler, due to AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper should inherit AbstractYarnScheduler, instead of implementing ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069783#comment-14069783 ] Hadoop QA commented on YARN-2242: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655592/YARN-2242-071414.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4389//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4389//console This message is automatically generated. Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, YARN-2242-071414.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069814#comment-14069814 ] Zhijie Shen commented on YARN-2270: --- +1, LGTM as well. The test failure was gone locally with this patch. TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Attachments: YARN-2270.2.patch, YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2330) Jobs are not displaying in timeline server after RM restart
Nishan Shetty created YARN-2330: --- Summary: Jobs are not displaying in timeline server after RM restart Key: YARN-2330 URL: https://issues.apache.org/jira/browse/YARN-2330 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.1 Environment: Nodemanagers 3 (3*8GB) Queues A = 70% Queues B = 30% Reporter: Nishan Shetty Submit jobs to queue a While job is running Restart RM Observe that those jobs are not displayed in timelineserver {code} 2014-07-22 10:11:32,084 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: History information of application application_1406002968974_0003 is not included into the result due to the exception java.io.IOException: Cannot seek to negative offset at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1381) at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63) at org.apache.hadoop.io.file.tfile.BCFile$Reader.init(BCFile.java:624) at org.apache.hadoop.io.file.tfile.TFile$Reader.init(TFile.java:804) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileReader.init(FileSystemApplicationHistoryStore.java:683) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileReader(FileSystemApplicationHistoryStore.java:661) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getApplication(FileSystemApplicationHistoryStore.java:146) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getAllApplications(FileSystemApplicationHistoryStore.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getAllApplications(ApplicationHistoryManagerImpl.java:103) at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:75) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:197) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:156) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069823#comment-14069823 ] Hadoop QA commented on YARN-2319: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657046/YARN-2319.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4391//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4391//console This message is automatically generated. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069831#comment-14069831 ] Hudson commented on YARN-2270: -- FAILURE: Integrated in Hadoop-trunk-Commit #5932 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5932/]) YARN-2270. Made TestFSDownload#testDownloadPublicWithStatCache be skipped when there’s no ancestor permissions. Contributed by Akira Ajisaka. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1612460) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Attachments: YARN-2270.2.patch, YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069832#comment-14069832 ] Zhijie Shen commented on YARN-2270: --- Committed to trunk, branch-2, and branch-2.5. Thanks [~ajisakaa] for the patch, and [~vvasudev] for the review! TestFSDownload#testDownloadPublicWithStatCache fails in trunk - Key: YARN-2270 URL: https://issues.apache.org/jira/browse/YARN-2270 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.1 Reporter: Ted Yu Assignee: Akira AJISAKA Priority: Minor Fix For: 2.5.0 Attachments: YARN-2270.2.patch, YARN-2270.patch From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : {code} Running org.apache.hadoop.yarn.util.TestFSDownload Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) {code} Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069858#comment-14069858 ] Hadoop QA commented on YARN-2319: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657046/YARN-2319.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4392//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4392//console This message is automatically generated. Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java --- Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-2319.0.patch, YARN-2319.1.patch MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)