[jira] [Updated] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-1197: - Assignee: (was: Wangda Tan) Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: mapreduce-project.patch.ver.1, tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, yarn-server-resourcemanager.patch.ver.1 The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934673#comment-13934673 ] Wangda Tan commented on YARN-1197: -- I'm leaving my current company on next week, and am no longer involved in YARN-1197, one of my colleagues will take this Jira and sub tasks. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: mapreduce-project.patch.ver.1, tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, yarn-server-resourcemanager.patch.ver.1 The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1609) Add Service Container type to NodeManager in YARN
[ https://issues.apache.org/jira/browse/YARN-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-1609: - Assignee: (was: Wangda Tan) Add Service Container type to NodeManager in YARN - Key: YARN-1609 URL: https://issues.apache.org/jira/browse/YARN-1609 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.2.0 Reporter: Wangda Tan Attachments: Add Service Container type to NodeManager in YARN-V1.pdf From our work to support running OpenMPI on YARN (MAPREDUCE-2911), we found that it’s important to have framework specific daemon process manage the tasks on each node directly. The daemon process, most likely similar in other frameworks as well, provides critical services to tasks running on that node(for example “wireup”, spawn user process in large numbers at once etc). In YARN, it’s hard, if not possible, to have the those processes to be managed by YARN. We propose to extend the container model on NodeManager side to support “Service Container” to run/manage such framework daemon/services process. We believe this is very useful to other application framework developers as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1609) Add Service Container type to NodeManager in YARN
[ https://issues.apache.org/jira/browse/YARN-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934675#comment-13934675 ] Wangda Tan commented on YARN-1609: -- I'm leaving my current company on next week, and am no longer involved in YARN-1609, one of my colleagues will take this Jira. Add Service Container type to NodeManager in YARN - Key: YARN-1609 URL: https://issues.apache.org/jira/browse/YARN-1609 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.2.0 Reporter: Wangda Tan Attachments: Add Service Container type to NodeManager in YARN-V1.pdf From our work to support running OpenMPI on YARN (MAPREDUCE-2911), we found that it’s important to have framework specific daemon process manage the tasks on each node directly. The daemon process, most likely similar in other frameworks as well, provides critical services to tasks running on that node(for example “wireup”, spawn user process in large numbers at once etc). In YARN, it’s hard, if not possible, to have the those processes to be managed by YARN. We propose to extend the container model on NodeManager side to support “Service Container” to run/manage such framework daemon/services process. We believe this is very useful to other application framework developers as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1685) Bugs around log URL
[ https://issues.apache.org/jira/browse/YARN-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1685: -- Attachment: YARN-1685.5.patch Upload a new patch with the new approach Bugs around log URL --- Key: YARN-1685 URL: https://issues.apache.org/jira/browse/YARN-1685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen Attachments: YARN-1685-1.patch, YARN-1685.2.patch, YARN-1685.3.patch, YARN-1685.4.patch, YARN-1685.5.patch 1. Log URL should be different when the container is running and finished 2. Null case needs to be handled 3. The way of constructing log URL should be corrected -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1685) Bugs around log URL
[ https://issues.apache.org/jira/browse/YARN-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934679#comment-13934679 ] Zhijie Shen commented on YARN-1685: --- Vinod, good suggestion. We can make use of the existing info to construct a log url string when rendering it. However, there's one issue. Based on current stored container information, we're unable to know whether the container has been launched before or not. If the container was not launched, we should not show the log url. Anyway, I agree not storing log url is a right way. How about we go back to fix the container not launched case separately, by enhancing the container exit status, state or something to indicate what happened to a container before? Bugs around log URL --- Key: YARN-1685 URL: https://issues.apache.org/jira/browse/YARN-1685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen Attachments: YARN-1685-1.patch, YARN-1685.2.patch, YARN-1685.3.patch, YARN-1685.4.patch, YARN-1685.5.patch 1. Log URL should be different when the container is running and finished 2. Null case needs to be handled 3. The way of constructing log URL should be corrected -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1685) Bugs around log URL
[ https://issues.apache.org/jira/browse/YARN-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934707#comment-13934707 ] Hadoop QA commented on YARN-1685: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634661/YARN-1685.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3361//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3361//console This message is automatically generated. Bugs around log URL --- Key: YARN-1685 URL: https://issues.apache.org/jira/browse/YARN-1685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen Attachments: YARN-1685-1.patch, YARN-1685.2.patch, YARN-1685.3.patch, YARN-1685.4.patch, YARN-1685.5.patch 1. Log URL should be different when the container is running and finished 2. Null case needs to be handled 3. The way of constructing log URL should be corrected -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1609) Add Service Container type to NodeManager in YARN
[ https://issues.apache.org/jira/browse/YARN-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang reassigned YARN-1609: Assignee: Jeff Zhang Add Service Container type to NodeManager in YARN - Key: YARN-1609 URL: https://issues.apache.org/jira/browse/YARN-1609 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.2.0 Reporter: Wangda Tan Assignee: Jeff Zhang Attachments: Add Service Container type to NodeManager in YARN-V1.pdf From our work to support running OpenMPI on YARN (MAPREDUCE-2911), we found that it’s important to have framework specific daemon process manage the tasks on each node directly. The daemon process, most likely similar in other frameworks as well, provides critical services to tasks running on that node(for example “wireup”, spawn user process in large numbers at once etc). In YARN, it’s hard, if not possible, to have the those processes to be managed by YARN. We propose to extend the container model on NodeManager side to support “Service Container” to run/manage such framework daemon/services process. We believe this is very useful to other application framework developers as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1832) wrong MockLocalizerStatus.equals() method implementation
Hong Zhiguo created YARN-1832: - Summary: wrong MockLocalizerStatus.equals() method implementation Key: YARN-1832 URL: https://issues.apache.org/jira/browse/YARN-1832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Hong Zhiguo Priority: Trivial return getLocalizerId().equals(other) ...; should be return getLocalizerId().equals(other. getLocalizerId()) ...; getLocalizerId() returns String. It's expected to compare this.getLocalizerId() against other.getLocalizerId(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang reassigned YARN-1197: Assignee: Jeff Zhang Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Assignee: Jeff Zhang Attachments: mapreduce-project.patch.ver.1, tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, yarn-server-resourcemanager.patch.ver.1 The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [jira] [Assigned] (YARN-1197) Support changing resources of anallocated container
Hi, Jeff, How can you assign an issue to yourself when you are not contributor yet? I want to assign an issue too. Thanks, Zhiguo -- Original -- From: Jeff Zhang (JIRA);j...@apache.org; Send time: Friday, Mar 14, 2014 4:59 PM To: yarn-issuesyarn-issues@hadoop.apache.org; Subject: [jira] [Assigned] (YARN-1197) Support changing resources of anallocated container [ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang reassigned YARN-1197: Assignee: Jeff Zhang Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Assignee: Jeff Zhang Attachments: mapreduce-project.patch.ver.1, tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, yarn-server-resourcemanager.patch.ver.1 The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource of an allocated container the only way is releasing it and allocating a new container with expected size. Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.2#6252) .
[jira] [Updated] (YARN-1832) wrong MockLocalizerStatus.equals() method implementation
[ https://issues.apache.org/jira/browse/YARN-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-1832: -- Attachment: YARN-1832.patch wrong MockLocalizerStatus.equals() method implementation Key: YARN-1832 URL: https://issues.apache.org/jira/browse/YARN-1832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Hong Zhiguo Priority: Trivial Attachments: YARN-1832.patch return getLocalizerId().equals(other) ...; should be return getLocalizerId().equals(other. getLocalizerId()) ...; getLocalizerId() returns String. It's expected to compare this.getLocalizerId() against other.getLocalizerId(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1832) wrong MockLocalizerStatus.equals() method implementation
[ https://issues.apache.org/jira/browse/YARN-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934816#comment-13934816 ] Hadoop QA commented on YARN-1832: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634678/YARN-1832.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3362//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3362//console This message is automatically generated. wrong MockLocalizerStatus.equals() method implementation Key: YARN-1832 URL: https://issues.apache.org/jira/browse/YARN-1832 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Hong Zhiguo Priority: Trivial Attachments: YARN-1832.patch return getLocalizerId().equals(other) ...; should be return getLocalizerId().equals(other. getLocalizerId()) ...; getLocalizerId() returns String. It's expected to compare this.getLocalizerId() against other.getLocalizerId(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1775) Create SMAPBasedProcessTree to get PSS information
[ https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated YARN-1775: --- Attachment: yarn-1775-2.4.0.patch Computes the RSS by reading /proc/pid/smaps. Tested with branch 2.4.0 on 20 node cluster. Create SMAPBasedProcessTree to get PSS information -- Key: YARN-1775 URL: https://issues.apache.org/jira/browse/YARN-1775 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Rajesh Balamohan Priority: Minor Attachments: yarn-1775-2.4.0.patch Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will make use of PSS for computing the memory usage. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1775) Create SMAPBasedProcessTree to get PSS information
[ https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934829#comment-13934829 ] Rajesh Balamohan commented on YARN-1775: Review request link : https://reviews.apache.org/r/19220/ Create SMAPBasedProcessTree to get PSS information -- Key: YARN-1775 URL: https://issues.apache.org/jira/browse/YARN-1775 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Rajesh Balamohan Priority: Minor Attachments: yarn-1775-2.4.0.patch Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will make use of PSS for computing the memory usage. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1775) Create SMAPBasedProcessTree to get PSS information
[ https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated YARN-1775: --- Fix Version/s: 2.5.0 Create SMAPBasedProcessTree to get PSS information -- Key: YARN-1775 URL: https://issues.apache.org/jira/browse/YARN-1775 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Rajesh Balamohan Priority: Minor Fix For: 2.5.0 Attachments: yarn-1775-2.4.0.patch Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will make use of PSS for computing the memory usage. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1658) Webservice should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934873#comment-13934873 ] Hudson commented on YARN-1658: -- FAILURE: Integrated in Hadoop-Yarn-trunk #509 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/509/]) YARN-1658. Modified web-app framework to let standby RMs redirect web-service calls to the active RM. Contributed by Cindy Li. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1577408) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/Dispatcher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/Router.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/WebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMDispatcher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java Webservice should redirect to active RM when HA is enabled. --- Key: YARN-1658 URL: https://issues.apache.org/jira/browse/YARN-1658 Project: Hadoop YARN Issue Type: Sub-task Reporter: Cindy Li Assignee: Cindy Li Labels: YARN Fix For: 2.4.0 Attachments: YARN1658.1.patch, YARN1658.2.patch, YARN1658.3.patch, YARN1658.patch When HA is enabled, web service to standby RM should be redirected to the active RM. This is a related Jira to YARN-1525. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934867#comment-13934867 ] Hudson commented on YARN-1771: -- FAILURE: Integrated in Hadoop-Yarn-trunk #509 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/509/]) YARN-1771. Reduce the number of NameNode operations during localization of public resources using a cache. Contributed by Sangjin Lee (cdouglas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1577391) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/test/java/org/apache/hadoop/mapred/TestLocalDistributedCacheManager.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizerContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Fix For: 3.0.0, 2.4.0 Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch, yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied
[ https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1813: - Attachment: YARN-1813.2.patch Included tests. Better error message for yarn logs when permission denied --- Key: YARN-1813 URL: https://issues.apache.org/jira/browse/YARN-1813 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Andrew Wang Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1813.1.patch, YARN-1813.2.patch I ran some MR jobs as the hdfs user, and then forgot to sudo -u when grabbing the logs. yarn logs prints an error message like the following: {noformat} [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at a2402.halxg.cloudera.com/10.20.212.10:8032 Logs not available at /tmp/logs/andrew.wang/logs/application_1394482121761_0010 Log aggregation has not completed or is not enabled. {noformat} It'd be nicer if it said Permission denied or AccessControlException or something like that instead, since that's the real issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1813) Better error message for yarn logs when permission denied
[ https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934937#comment-13934937 ] Hadoop QA commented on YARN-1813: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634698/YARN-1813.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3363//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3363//console This message is automatically generated. Better error message for yarn logs when permission denied --- Key: YARN-1813 URL: https://issues.apache.org/jira/browse/YARN-1813 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Andrew Wang Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1813.1.patch, YARN-1813.2.patch I ran some MR jobs as the hdfs user, and then forgot to sudo -u when grabbing the logs. yarn logs prints an error message like the following: {noformat} [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at a2402.halxg.cloudera.com/10.20.212.10:8032 Logs not available at /tmp/logs/andrew.wang/logs/application_1394482121761_0010 Log aggregation has not completed or is not enabled. {noformat} It'd be nicer if it said Permission denied or AccessControlException or something like that instead, since that's the real issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1813) Better error message for yarn logs when permission denied
[ https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934960#comment-13934960 ] Tsuyoshi OZAWA commented on YARN-1813: -- [~andrew.wang], can you review a latest patch? Better error message for yarn logs when permission denied --- Key: YARN-1813 URL: https://issues.apache.org/jira/browse/YARN-1813 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Andrew Wang Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1813.1.patch, YARN-1813.2.patch I ran some MR jobs as the hdfs user, and then forgot to sudo -u when grabbing the logs. yarn logs prints an error message like the following: {noformat} [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at a2402.halxg.cloudera.com/10.20.212.10:8032 Logs not available at /tmp/logs/andrew.wang/logs/application_1394482121761_0010 Log aggregation has not completed or is not enabled. {noformat} It'd be nicer if it said Permission denied or AccessControlException or something like that instead, since that's the real issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935021#comment-13935021 ] Hudson commented on YARN-1771: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1701 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1701/]) YARN-1771. Reduce the number of NameNode operations during localization of public resources using a cache. Contributed by Sangjin Lee (cdouglas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1577391) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/test/java/org/apache/hadoop/mapred/TestLocalDistributedCacheManager.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizerContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Fix For: 3.0.0, 2.4.0 Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch, yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1658) Webservice should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935027#comment-13935027 ] Hudson commented on YARN-1658: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1701 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1701/]) YARN-1658. Modified web-app framework to let standby RMs redirect web-service calls to the active RM. Contributed by Cindy Li. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1577408) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/Dispatcher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/Router.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/WebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMDispatcher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java Webservice should redirect to active RM when HA is enabled. --- Key: YARN-1658 URL: https://issues.apache.org/jira/browse/YARN-1658 Project: Hadoop YARN Issue Type: Sub-task Reporter: Cindy Li Assignee: Cindy Li Labels: YARN Fix For: 2.4.0 Attachments: YARN1658.1.patch, YARN1658.2.patch, YARN1658.3.patch, YARN1658.patch When HA is enabled, web service to standby RM should be redirected to the active RM. This is a related Jira to YARN-1525. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935038#comment-13935038 ] Hadoop QA commented on YARN-1769: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634711/YARN-1769.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3364//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3364//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3364//console This message is automatically generated. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1717) Enable offline deletion of entries in leveldb timeline store
[ https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-1717: - Attachment: YARN-1717.11.patch I renamed the thread EntityDeletionThread and logged a warning when interrupted. I also found a couple of missing checks for null and addressed those. Enable offline deletion of entries in leveldb timeline store Key: YARN-1717 URL: https://issues.apache.org/jira/browse/YARN-1717 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1717.1.patch, YARN-1717.10.patch, YARN-1717.11.patch, YARN-1717.2.patch, YARN-1717.3.patch, YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch The leveldb timeline store implementation needs the following: * better documentation of its internal structures * internal changes to enable deleting entities ** never overwrite existing primary filter entries ** add hidden reverse pointers to related entities -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1717) Enable offline deletion of entries in leveldb timeline store
[ https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935117#comment-13935117 ] Hadoop QA commented on YARN-1717: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634721/YARN-1717.11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3366//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3366//console This message is automatically generated. Enable offline deletion of entries in leveldb timeline store Key: YARN-1717 URL: https://issues.apache.org/jira/browse/YARN-1717 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1717.1.patch, YARN-1717.10.patch, YARN-1717.11.patch, YARN-1717.2.patch, YARN-1717.3.patch, YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch The leveldb timeline store implementation needs the following: * better documentation of its internal structures * internal changes to enable deleting entities ** never overwrite existing primary filter entries ** add hidden reverse pointers to related entities -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1658) Webservice should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935106#comment-13935106 ] Hudson commented on YARN-1658: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1726 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1726/]) YARN-1658. Modified web-app framework to let standby RMs redirect web-service calls to the active RM. Contributed by Cindy Li. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1577408) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/Dispatcher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/Router.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/WebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMDispatcher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java Webservice should redirect to active RM when HA is enabled. --- Key: YARN-1658 URL: https://issues.apache.org/jira/browse/YARN-1658 Project: Hadoop YARN Issue Type: Sub-task Reporter: Cindy Li Assignee: Cindy Li Labels: YARN Fix For: 2.4.0 Attachments: YARN1658.1.patch, YARN1658.2.patch, YARN1658.3.patch, YARN1658.patch When HA is enabled, web service to standby RM should be redirected to the active RM. This is a related Jira to YARN-1525. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935100#comment-13935100 ] Hudson commented on YARN-1771: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1726 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1726/]) YARN-1771. Reduce the number of NameNode operations during localization of public resources using a cache. Contributed by Sangjin Lee (cdouglas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1577391) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/test/java/org/apache/hadoop/mapred/TestLocalDistributedCacheManager.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizerContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Fix For: 3.0.0, 2.4.0 Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch, yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1833) TestRMAdminService Fails in branch-2
Mit Desai created YARN-1833: --- Summary: TestRMAdminService Fails in branch-2 Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Mit Desai Assignee: Mit Desai In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935095#comment-13935095 ] Hadoop QA commented on YARN-1769: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634719/YARN-1769.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3365//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3365//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3365//console This message is automatically generated. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1717) Enable offline deletion of entries in leveldb timeline store
[ https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935128#comment-13935128 ] Billie Rinaldi commented on YARN-1717: -- bq. We do deletion according to entity's TS and at the entity's granularity, thus, the events that are still alive are likely to be deleted as well. I believe this is the desired behavior. For example, in the case where we have a job entity that starts several shorter-lived task entities, we would not want to remove task entities before the job entity is removed. With the current behavior, the job entity would be removed at the same time or earlier than the task entities. We don't yet have a good understanding of how applications with long-lived entities would want to use the timeline store, so it's hard to design for them. Perhaps an option for the future would be to have a configurable deletion strategy, if some applications have different requirements. Enable offline deletion of entries in leveldb timeline store Key: YARN-1717 URL: https://issues.apache.org/jira/browse/YARN-1717 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1717.1.patch, YARN-1717.10.patch, YARN-1717.11.patch, YARN-1717.2.patch, YARN-1717.3.patch, YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch The leveldb timeline store implementation needs the following: * better documentation of its internal structures * internal changes to enable deleting entities ** never overwrite existing primary filter entries ** add hidden reverse pointers to related entities -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1690) sending ATS events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935148#comment-13935148 ] Mayank Bansal commented on YARN-1690: - Thanks [~zjshen] for the review bq. 1. Catch Exception and merge the duplicate handling. Done bq. 2. Call it timeline client, and similar for the following code. Done bq. 3. Is the following the related change This is done due to logging issue, so its good to put it. bq. 4. Don't create a new config, but use the existing one. It has to be created, there is no previous config bq. 5. Call it DS_CONTAINER? Do not confuse it with the generic information. Done bq. 6. Entity type is different from event type. Call it DS_APPLICATION_ATTEMPT? Done bq. 7. Event type is not set Done bq. 8. Correct STatus Done bq. 9. Can you add user as the primary filter? Done bq. 10. In general, it doesn't make sense to record the information that the generic history service has already captured, such as the other info for container. It's per-framework data, such that it's better to record some DS specific information. chaged the names bq. 11. Need more assertion. For example, test both container and attempt entities. Done bq. 12. Mark it @Private as well Done bq. 13. Correct comment? It seems you choose to set default AHS address, and don't understand why it is related to YARN_MINICLUSTER_FIXED_PORTS. Done sending ATS events from Distributed shell -- Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1690) sending ATS events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1690: Attachment: YARN-1690-4.patch Attaching latest patch Thanks, Mayank sending ATS events from Distributed shell -- Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1690) sending ATS events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935180#comment-13935180 ] Hadoop QA commented on YARN-1690: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634729/YARN-1690-4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3367//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3367//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-applications-distributedshell.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3367//console This message is automatically generated. sending ATS events from Distributed shell -- Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1685) Bugs around log URL
[ https://issues.apache.org/jira/browse/YARN-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935195#comment-13935195 ] Tsuyoshi OZAWA commented on YARN-1685: -- TestResourceTrackerService failure is tracked on YARN-1591. TestRMRestart failure is tracked on YARN-1830. Bugs around log URL --- Key: YARN-1685 URL: https://issues.apache.org/jira/browse/YARN-1685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen Attachments: YARN-1685-1.patch, YARN-1685.2.patch, YARN-1685.3.patch, YARN-1685.4.patch, YARN-1685.5.patch 1. Log URL should be different when the container is running and finished 2. Null case needs to be handled 3. The way of constructing log URL should be corrected -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk
[ https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935205#comment-13935205 ] Tsuyoshi OZAWA commented on YARN-1591: -- The failure of TestRMRestart is unrelated and tracked on YARN-1830. I ran hundreds time of TestResourceTrackerService last night and a latest patch looks work well. [~jianhe], can you take a look? TestResourceTrackerService fails randomly on trunk -- Key: YARN-1591 URL: https://issues.apache.org/jira/browse/YARN-1591 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Attachments: YARN-1591.1.patch, YARN-1591.2.patch, YARN-1591.3.patch, YARN-1591.3.patch, YARN-1591.5.patch, YARN-1591.6.patch As evidenced by Jenkins at https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621. It's failing randomly on trunk on my local box too -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1834) YarnClient will not be redirected to the history server when RM is done
Zhijie Shen created YARN-1834: - Summary: YarnClient will not be redirected to the history server when RM is done Key: YARN-1834 URL: https://issues.apache.org/jira/browse/YARN-1834 Project: Hadoop YARN Issue Type: Improvement Reporter: Zhijie Shen When RM is not available, the client will keep retrying on RM, such that it won't reach the history server to get the app/atttempt/container's info. Therefore, during RM restart, such a request will be blocked. However, it has the opportunity to move on given history service is enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1834) YarnClient will not be redirected to the history server when RM is done
[ https://issues.apache.org/jira/browse/YARN-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1834: -- Issue Type: Sub-task (was: Improvement) Parent: YARN-321 YarnClient will not be redirected to the history server when RM is done --- Key: YARN-1834 URL: https://issues.apache.org/jira/browse/YARN-1834 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen When RM is not available, the client will keep retrying on RM, such that it won't reach the history server to get the app/atttempt/container's info. Therefore, during RM restart, such a request will be blocked. However, it has the opportunity to move on given history service is enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1521) Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation
[ https://issues.apache.org/jira/browse/YARN-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935274#comment-13935274 ] Zhijie Shen commented on YARN-1521: --- Just a reminder, ApplicationClientProtocol has four more methods, whose idempotency needs to be verified as well: 1. getApplicationAttemptReport 2. getApplicationAttempts 3. getContainerReport 4. getContainers Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation - Key: YARN-1521 URL: https://issues.apache.org/jira/browse/YARN-1521 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong After YARN-1028, we add the automatically failover into RMProxy. This JIRA is to identify whether we need to add idempotent annotation and which methods can be marked as idempotent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1835) History client service needs to be more robust
Zhijie Shen created YARN-1835: - Summary: History client service needs to be more robust Key: YARN-1835 URL: https://issues.apache.org/jira/browse/YARN-1835 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen While doing the test, I've found the following issues so far: 1. The history file not found exception is exposed to the user directly, which is better to be caught and translated into ApplicationNotFound. 2. NPE will be exposed as well, since ApplicationHistoryManager doesn't do necessary null check. In addition, TestApplicationHistoryManagerImpl missed to test most ApplicationHistoryManager methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1835) History client service needs to be more robust
[ https://issues.apache.org/jira/browse/YARN-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1835: -- Issue Type: Sub-task (was: Bug) Parent: YARN-321 History client service needs to be more robust -- Key: YARN-1835 URL: https://issues.apache.org/jira/browse/YARN-1835 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen While doing the test, I've found the following issues so far: 1. The history file not found exception is exposed to the user directly, which is better to be caught and translated into ApplicationNotFound. 2. NPE will be exposed as well, since ApplicationHistoryManager doesn't do necessary null check. In addition, TestApplicationHistoryManagerImpl missed to test most ApplicationHistoryManager methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated YARN-1771: Fix Version/s: 2.5.0 many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Fix For: 3.0.0, 2.4.0, 2.5.0 Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch, yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935310#comment-13935310 ] Chris Douglas commented on YARN-1771: - bq. It would be great if you could commit this to branch-2.4 too... Sure, np. Done many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Fix For: 3.0.0, 2.4.0, 2.5.0 Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch, yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935313#comment-13935313 ] Akira AJISAKA commented on YARN-1833: - I understand the test will fail if testuser belongs to 3 (={{groupBefore.size()}}) groups. +1 for removing the assertion. In addition, I think the test will fail in trunk also. TestRMAdminService Fails in branch-2 Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Mit Desai Assignee: Mit Desai In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935315#comment-13935315 ] Sangjin Lee commented on YARN-1771: --- Thanks! many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Fix For: 3.0.0, 2.4.0, 2.5.0 Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch, yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1811) RM HA: AM link broken if the AM is on nodes other than RM
[ https://issues.apache.org/jira/browse/YARN-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935329#comment-13935329 ] Robert Kanter commented on YARN-1811: - Ok, I'll make it {{@Public}} and put back the old constants and behavior. The AmIpFilter is supposed to be created by the AmFilterInititalizer, so I think we can make the new constants package private so that only the Inititalizer uses them going forward; and mark the old constants as deprecated. I agree that it would be simpler to just redirect to any of the RMs and assume it auto-redirects to the active RM. Though this won't work if that RM is currently down; so I think we have to check for the active RM, which should be up. I'll look at RMHAUtils though. {{conf.getValByRegex()}} seemed simpler, but I see your point. If they have some invalid config properties that match, it will pick those up. RM HA: AM link broken if the AM is on nodes other than RM - Key: YARN-1811 URL: https://issues.apache.org/jira/browse/YARN-1811 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: YARN-1811.patch, YARN-1811.patch, YARN-1811.patch, YARN-1811.patch When using RM HA, if you click on the Application Master link in the RM web UI while the job is running, you get an Error 500: -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1836) Add retry cache support in ResourceManager
Tsuyoshi OZAWA created YARN-1836: Summary: Add retry cache support in ResourceManager Key: YARN-1836 URL: https://issues.apache.org/jira/browse/YARN-1836 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA HDFS-4942 supports RetryCache on NN. This JIRA tracks RetryCache on ResourceManager. If the RPCs are non-idempotent, we should use RetryCache to avoid returning incorrect failures to client. YARN-1521 is a related JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935331#comment-13935331 ] Mit Desai commented on YARN-1833: - I am in the process of generating the patch. I will be uploading it soon. TestRMAdminService Fails in branch-2 Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Mit Desai Assignee: Mit Desai In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1521) Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation
[ https://issues.apache.org/jira/browse/YARN-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935336#comment-13935336 ] Tsuyoshi OZAWA commented on YARN-1521: -- We should introduce RetryCache to avoid returning incorrect errors to client if non-idempotent RPC(ie. submitApplication) are executed. Opened YARN-1836 for this. Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation - Key: YARN-1521 URL: https://issues.apache.org/jira/browse/YARN-1521 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong After YARN-1028, we add the automatically failover into RMProxy. This JIRA is to identify whether we need to add idempotent annotation and which methods can be marked as idempotent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1833) TestRMAdminService Fails in branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-1833: Attachment: YARN-1833.patch Attaching the patch for trunk and branch-2 TestRMAdminService Fails in branch-2 Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935374#comment-13935374 ] Akira AJISAKA commented on YARN-1833: - +1 TestRMAdminService Fails in branch-2 Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-1833: Hadoop Flags: Reviewed Summary: TestRMAdminService Fails in trunk and branch-2 (was: TestRMAdminService Fails in branch-2) TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1717) Enable offline deletion of entries in leveldb timeline store
[ https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935409#comment-13935409 ] Billie Rinaldi commented on YARN-1717: -- In testing of writing and age off at the same time, the deletion thread did not seem to adversely affect the write rate. With a single writer, I saw about 450 single-entity puts per second, which is comparable to what I had observed previously. I configured the deletion thread to age data off after 90 seconds, and also set the deletion cycle interval to 90 seconds. It was able to age off data around 4500 entities per second. With these settings, it typically aged off on the order of 36,000 entities per cycle in less than 8 seconds. Enable offline deletion of entries in leveldb timeline store Key: YARN-1717 URL: https://issues.apache.org/jira/browse/YARN-1717 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1717.1.patch, YARN-1717.10.patch, YARN-1717.11.patch, YARN-1717.2.patch, YARN-1717.3.patch, YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch The leveldb timeline store implementation needs the following: * better documentation of its internal structures * internal changes to enable deleting entities ** never overwrite existing primary filter entries ** add hidden reverse pointers to related entities -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1795) After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935428#comment-13935428 ] Jian He commented on YARN-1795: --- [~rkanter], {code} 2014-03-06 19:01:24,731 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1394161202967_0004_01_04 taskAttempt attempt_1394161202967_0004_m_01_0 2014-03-06 19:01:24,733 INFO [ContainerLauncher #0] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1394161202967_0004_m_00_0 2014-03-06 19:01:24,733 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1394161202967_0004_m_01_0 2014-03-06 19:01:24,734 INFO [ContainerLauncher #0] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: AAA numTokens = 1 NMToken :: 172.16.1.64:52707 :: 172.16.1.64:52707 2014-03-06 19:01:24,734 INFO [ContainerLauncher #0] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : 172.16.1.64:52707 2014-03-06 19:01:24,748 INFO [ContainerLauncher #1] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: AAA numTokens = 1 NMToken :: 172.16.1.64:52707 :: 172.16.1.64:52707 {code} How are you printing the logging? why two duplicate NMTokens printed? but numTokens == 1 After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Priority: Blocker Attachments: org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? Update: This also happens in an actual cluster, not just Oozie's unit tests -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1799) Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff
[ https://issues.apache.org/jira/browse/YARN-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935431#comment-13935431 ] Karthik Kambatla commented on YARN-1799: bq. Given the disk write speed as a configuration (based on disk type, rpm etc), these factors can be derived. And allotted space for a task can also be considered. Sounds reasonable. Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff - Key: YARN-1799 URL: https://issues.apache.org/jira/browse/YARN-1799 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Sunil G LocalDirAllocator provides paths for all tasks for its local write. This considers the good list of directories which are selected by the HealthCheck mechamnism in LocalDirsHandlerService getLocalPathForWrite() considers whether input demand size can meet the capacity in lastAccessed directory. If more tasks asks for path from LocalDirAllocator, then it is possible that the allocation is done based on the current disk availability at that given time. But this path would have earlier given to some other tasks to write and they may be sequentially doing writing. It is better to check for an upper cutoff for disk availability -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1795) After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935437#comment-13935437 ] Robert Kanter commented on YARN-1795: - Sorry, I didn't explain more specifically what I had printed out. Each line is a for a token and in this format: {{NMToken :: key :: service}} where the {{key}} is the key from the hash map in NMTokenCache and the {{service}} is the service in the token. So those end up being the same. So, its only printing one token in that snippet After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Priority: Blocker Attachments: org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? Update: This also happens in an actual cluster, not just Oozie's unit tests -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1795) After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1795: -- Assignee: Karthik Kambatla After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Assignee: Karthik Kambatla Priority: Blocker Attachments: org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? Update: This also happens in an actual cluster, not just Oozie's unit tests -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1795) After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935439#comment-13935439 ] Karthik Kambatla commented on YARN-1795: Taking this up to investigate. After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Assignee: Karthik Kambatla Priority: Blocker Attachments: org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? Update: This also happens in an actual cluster, not just Oozie's unit tests -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1837) TestMoveApplication.testMoveRejectedByScheduler randomly fails
Tsuyoshi OZAWA created YARN-1837: Summary: TestMoveApplication.testMoveRejectedByScheduler randomly fails Key: YARN-1837 URL: https://issues.apache.org/jira/browse/YARN-1837 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: Tsuyoshi OZAWA TestMoveApplication#testMoveRejectedByScheduler fails because of NullPointerException. It looks caused by unhandled exception handling at server-side. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1837) TestMoveApplication.testMoveRejectedByScheduler randomly fails
[ https://issues.apache.org/jira/browse/YARN-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935448#comment-13935448 ] Tsuyoshi OZAWA commented on YARN-1837: -- Terminal log: {code} $ mvn test ... Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.243 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication testMoveRejectedByScheduler(org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication) Time elapsed: 0.36 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication.testMoveRejectedByScheduler(TestMoveApplication.java:83) {code} TestMoveApplication-output.txt: {code} 2014-03-14 18:43:31,582 ERROR [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(634)) - Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APP_ACCEPTED at NEW_SAVING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:632) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:685) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:669) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} TestMoveApplication.testMoveRejectedByScheduler randomly fails -- Key: YARN-1837 URL: https://issues.apache.org/jira/browse/YARN-1837 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: Tsuyoshi OZAWA TestMoveApplication#testMoveRejectedByScheduler fails because of NullPointerException. It looks caused by unhandled exception handling at server-side. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935450#comment-13935450 ] Hadoop QA commented on YARN-1833: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634771/YARN-1833.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3368//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3368//console This message is automatically generated. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1474: - Attachment: YARN-1474.7.patch Updated a patch to pass tests. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk
[ https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935502#comment-13935502 ] Mit Desai commented on YARN-1591: - Hey, I have done a little investigation on the test. {code} static { DefaultMetricsSystem.setMiniClusterMode(true); } {code} Setting this property ignores the Metrics source already in the unit test. This change seems to be working on my local machine. What do you guys think of it? TestResourceTrackerService fails randomly on trunk -- Key: YARN-1591 URL: https://issues.apache.org/jira/browse/YARN-1591 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Attachments: YARN-1591.1.patch, YARN-1591.2.patch, YARN-1591.3.patch, YARN-1591.3.patch, YARN-1591.5.patch, YARN-1591.6.patch As evidenced by Jenkins at https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621. It's failing randomly on trunk on my local box too -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935505#comment-13935505 ] Mit Desai commented on YARN-1833: - Thanks Akira. Even I verified that it is not related to my patch. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1811) RM HA: AM link broken if the AM is on nodes other than RM
[ https://issues.apache.org/jira/browse/YARN-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935528#comment-13935528 ] Robert Kanter commented on YARN-1811: - {quote}If we still do the redirection, where you concatenate RM-IDs, you should use RMHAUtils.{quote} Actually, [~vinodkv], what do you mean by this? It's already using RMHAUtils to find the active RM. RM HA: AM link broken if the AM is on nodes other than RM - Key: YARN-1811 URL: https://issues.apache.org/jira/browse/YARN-1811 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: YARN-1811.patch, YARN-1811.patch, YARN-1811.patch, YARN-1811.patch When using RM HA, if you click on the Application Master link in the RM web UI while the job is running, you get an Error 500: -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935546#comment-13935546 ] Jonathan Eagles commented on YARN-1833: --- [~mdesai], instead of removing the check, could we investigate using UserGroupInformation.createUserForTesting. This will allow us to isolate the developer environment as a requirement to the correctness of the test. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935559#comment-13935559 ] Hadoop QA commented on YARN-1474: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634796/YARN-1474.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3369//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3369//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3369//console This message is automatically generated. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1717) Enable offline deletion of entries in leveldb timeline store
[ https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935568#comment-13935568 ] Zhijie Shen commented on YARN-1717: --- Billie, thanks for your metrics. I've don some simple calculation myself. In long term, if a cluster has x entities written per second, no matter how long the ttl is, the number of entities to delete per second should be x on average. Therefore, let's say throughput of put requests is 100 entities/sec, the number of entities to delete per second will be 100 as well. Given we do the deletion every 5 minutes, we have 30,000 entities to delete per round. According to your measurement, it will take less than 8 sec to complete the deletion. The deletion will delay put request, but every 5 mins, it just happens for 8 secs, i.e., 2.67%. It sounds good to me. +1 for the patch. Will commit it. Enable offline deletion of entries in leveldb timeline store Key: YARN-1717 URL: https://issues.apache.org/jira/browse/YARN-1717 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1717.1.patch, YARN-1717.10.patch, YARN-1717.11.patch, YARN-1717.2.patch, YARN-1717.3.patch, YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch The leveldb timeline store implementation needs the following: * better documentation of its internal structures * internal changes to enable deleting entities ** never overwrite existing primary filter entries ** add hidden reverse pointers to related entities -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1717) Enable offline deletion of entries in leveldb timeline store
[ https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1717: -- Hadoop Flags: Reviewed Enable offline deletion of entries in leveldb timeline store Key: YARN-1717 URL: https://issues.apache.org/jira/browse/YARN-1717 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1717.1.patch, YARN-1717.10.patch, YARN-1717.11.patch, YARN-1717.2.patch, YARN-1717.3.patch, YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch The leveldb timeline store implementation needs the following: * better documentation of its internal structures * internal changes to enable deleting entities ** never overwrite existing primary filter entries ** add hidden reverse pointers to related entities -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1811) RM HA: AM link broken if the AM is on nodes other than RM
[ https://issues.apache.org/jira/browse/YARN-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-1811: Attachment: YARN-1811.patch Updated patch based on Vinod's comments RM HA: AM link broken if the AM is on nodes other than RM - Key: YARN-1811 URL: https://issues.apache.org/jira/browse/YARN-1811 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: YARN-1811.patch, YARN-1811.patch, YARN-1811.patch, YARN-1811.patch, YARN-1811.patch When using RM HA, if you click on the Application Master link in the RM web UI while the job is running, you get an Error 500: -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-1833: Attachment: YARN-1833-v2.patch Thanks [~jeagles] for the suggestion. I did not think about that solution. Attaching the new patch with the dummyUser for the test and no Assert Removed. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833-v2.patch, YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935594#comment-13935594 ] Alejandro Abdelnur commented on YARN-796: - scheduler configurations are refreshed dynamically, if the list of valid labels is there, it could be refreshed as well. i would prefer to detect reject typos from a user experience and troubleshooting point of view. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Arun C Murthy It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1717) Enable offline deletion of entries in leveldb timeline store
[ https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935618#comment-13935618 ] Hudson commented on YARN-1717: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5331 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5331/]) YARN-1717. Enabled periodically discarding old data in LeveldbTimelineStore. Contributed by Billie Rinaldi. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1577693) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline/GenericObjectMapper.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline/LeveldbTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline/TestLeveldbTimelineStore.java Enable offline deletion of entries in leveldb timeline store Key: YARN-1717 URL: https://issues.apache.org/jira/browse/YARN-1717 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Fix For: 2.4.0 Attachments: YARN-1717.1.patch, YARN-1717.10.patch, YARN-1717.11.patch, YARN-1717.2.patch, YARN-1717.3.patch, YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch The leveldb timeline store implementation needs the following: * better documentation of its internal structures * internal changes to enable deleting entities ** never overwrite existing primary filter entries ** add hidden reverse pointers to related entities -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935625#comment-13935625 ] Chen He commented on YARN-1833: --- +1, theYARN-1833-v2.patch works and the unit test passed. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833-v2.patch, YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1551) Allow user-specified reason for killApplication
[ https://issues.apache.org/jira/browse/YARN-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-1551: Target Version/s: 2.4.0 Affects Version/s: (was: 2.4.0) 2.3.0 Allow user-specified reason for killApplication --- Key: YARN-1551 URL: https://issues.apache.org/jira/browse/YARN-1551 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1551.v01.patch, YARN-1551.v02.patch, YARN-1551.v03.patch, YARN-1551.v04.patch, YARN-1551.v05.patch, YARN-1551.v06.patch, YARN-1551.v06.patch This completes MAPREDUCE-5648 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1206) Container logs link is broken on RM web UI after application finished
[ https://issues.apache.org/jira/browse/YARN-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935658#comment-13935658 ] Jian He commented on YARN-1206: --- Thanks for the patch ! LGTM, can you add a comment say why we should not check container == null ? Container logs link is broken on RM web UI after application finished - Key: YARN-1206 URL: https://issues.apache.org/jira/browse/YARN-1206 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Priority: Blocker Attachments: YARN-1206.patch With log aggregation disabled, when container is running, its logs link works properly, but after the application is finished, the link shows 'Container does not exist.' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935673#comment-13935673 ] Hadoop QA commented on YARN-1833: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634829/YARN-1833-v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3370//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3370//console This message is automatically generated. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833-v2.patch, YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935689#comment-13935689 ] Tsuyoshi OZAWA commented on YARN-1474: -- [~kkambatl], [~sandyr], [~vinodkv], Now a latest patch passes tests. The test failure is filed on YARN-1591 and unrelated to this JIRA. I appreciate if you can take a look at the patch. If you have additional comments or better approach, please let me know. Thanks! Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1838) ATS entities api should provide ability to get entities from given id
Srimanth Gunturi created YARN-1838: -- Summary: ATS entities api should provide ability to get entities from given id Key: YARN-1838 URL: https://issues.apache.org/jira/browse/YARN-1838 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Srimanth Gunturi To support pagination, we need ability to get entities from a certain ID by providing a new param called {{fromid}}. For example on a page of 10 jobs, our first call will be like [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfolimit=11] When user hits next, we would like to call [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfofromid=JID11limit=11] and continue on for further _Next_ clicks On hitting back, we will make similar calls for previous items [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfofromid=JID1limit=11] {{fromid}} should be inclusive of the id given. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-1515: Target Version/s: 2.4.0 Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: New Feature Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1838) Timeline service getEntities API should provide ability to get entities from given id
[ https://issues.apache.org/jira/browse/YARN-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1838: -- Component/s: (was: api) Assignee: Billie Rinaldi Summary: Timeline service getEntities API should provide ability to get entities from given id (was: ATS entities api should provide ability to get entities from given id) Timeline service getEntities API should provide ability to get entities from given id - Key: YARN-1838 URL: https://issues.apache.org/jira/browse/YARN-1838 Project: Hadoop YARN Issue Type: Sub-task Reporter: Srimanth Gunturi Assignee: Billie Rinaldi To support pagination, we need ability to get entities from a certain ID by providing a new param called {{fromid}}. For example on a page of 10 jobs, our first call will be like [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfolimit=11] When user hits next, we would like to call [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfofromid=JID11limit=11] and continue on for further _Next_ clicks On hitting back, we will make similar calls for previous items [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfofromid=JID1limit=11] {{fromid}} should be inclusive of the id given. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935704#comment-13935704 ] Jonathan Eagles commented on YARN-1833: --- +1. YARN-1830 causes the TestRMRestart error. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833-v2.patch, YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1838) Timeline service getEntities API should provide ability to get entities from given id
[ https://issues.apache.org/jira/browse/YARN-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-1838: - Attachment: YARN-1838.1.patch The attached patch implements the fromId parameter. I used camel case for fromId to match the other query parameters (primaryFilter, windowEnd, etc.). Timeline service getEntities API should provide ability to get entities from given id - Key: YARN-1838 URL: https://issues.apache.org/jira/browse/YARN-1838 Project: Hadoop YARN Issue Type: Sub-task Reporter: Srimanth Gunturi Assignee: Billie Rinaldi Attachments: YARN-1838.1.patch To support pagination, we need ability to get entities from a certain ID by providing a new param called {{fromid}}. For example on a page of 10 jobs, our first call will be like [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfolimit=11] When user hits next, we would like to call [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfofromid=JID11limit=11] and continue on for further _Next_ clicks On hitting back, we will make similar calls for previous items [http://server:8188/ws/v1/timeline/HIVE_QUERY_ID?fields=events,primaryfilters,otherinfofromid=JID1limit=11] {{fromid}} should be inclusive of the id given. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935716#comment-13935716 ] Akira AJISAKA commented on YARN-1833: - Thanks [~jeagles] and [~mitdesai] for improving. +1 for the v2 patch. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833-v2.patch, YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935729#comment-13935729 ] Hudson commented on YARN-1833: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5333 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5333/]) YARN-1833. TestRMAdminService Fails in trunk and branch-2 (Mit Desais via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1577737) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Attachments: YARN-1833-v2.patch, YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1690) sending ATS events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1690: Attachment: YARN-1690-5.patch Fixing Find bug warning Thanks, Mayank sending ATS events from Distributed shell -- Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch, YARN-1690-5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1599) webUI rm.webapp.AppBlock should redirect to a history App page if and when available
[ https://issues.apache.org/jira/browse/YARN-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935742#comment-13935742 ] Gera Shegalov commented on YARN-1599: - [~jlowe] thanks for pointing in the right direction. E.g., setting yarn.log.server.url to {code}http://${mapreduce.jobhistory.webapp.address}/jobhistory/logs{code} solves the problem on the pseudo-distributed cluster. webUI rm.webapp.AppBlock should redirect to a history App page if and when available Key: YARN-1599 URL: https://issues.apache.org/jira/browse/YARN-1599 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha, 2.2.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: Screen Shot 2014-01-16 at 6.52.17 PM.png, Screen Shot 2014-01-16 at 7.30.32 PM.png, YARN-1599.v01.patch, YARN-1599.v02.patch, YARN-1599.v03.patch When the log aggregation is enabled, and the application finishes, our users think that the AppMaster logs were lost because the link to the AM attempt logs are not updated and result in HTTP 404. Only tracking URL is updated. In order to have a smoother user experience, we propose to simply redirect to the new tracking URL when the page with invalid log links is accessed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1690) sending ATS events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935768#comment-13935768 ] Hadoop QA commented on YARN-1690: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634854/YARN-1690-5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3371//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3371//console This message is automatically generated. sending ATS events from Distributed shell -- Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch, YARN-1690-5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk
[ https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935794#comment-13935794 ] Tsuyoshi OZAWA commented on YARN-1591: -- Hi [~mitdesai], thank you for joining this JIRA! In fact, the approach you suggested was essentially same as YARN-1591.1.patch. It's insufficient to deal with the intermittent test failure, because sometimes the other problem can happen. It's caused by unhandled YarnRuntimeException from AsyncDispacher. The log at the time is as follows: {code} $ for i in `seq 1 100`; do mvn test -Dtest=TestResourceTrackerService | grep FAILURE; done ... sometimes occurs failure and output file is as follows... 2014-03-14 22:59:31,468 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(180)) - Error in dispatcher thread org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher.handle(ResourceManager.java:633) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher.handle(ResourceManager.java:539) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher.handle(ResourceManager.java:631) ... 4 more {code} First patch is available at: https://issues.apache.org/jira/secure/attachment/12633362/YARN-1591.1.patch TestResourceTrackerService fails randomly on trunk -- Key: YARN-1591 URL: https://issues.apache.org/jira/browse/YARN-1591 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Attachments: YARN-1591.1.patch, YARN-1591.2.patch, YARN-1591.3.patch, YARN-1591.3.patch, YARN-1591.5.patch, YARN-1591.6.patch As evidenced by Jenkins at https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621. It's failing randomly on trunk on my local box too -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1536) Cleanup: Get rid of ResourceManager#get*SecretManager() methods and use the RMContext methods instead
[ https://issues.apache.org/jira/browse/YARN-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935803#comment-13935803 ] Anubhav Dhoot commented on YARN-1536: - The test failures are unrelated. The change only replaced a function with its inline expansion and removed the function. Cleanup: Get rid of ResourceManager#get*SecretManager() methods and use the RMContext methods instead - Key: YARN-1536 URL: https://issues.apache.org/jira/browse/YARN-1536 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Anubhav Dhoot Priority: Minor Labels: newbie Attachments: yarn-1536.patch Both ResourceManager and RMContext have methods to access the secret managers, and it should be safe (cleaner) to get rid of the ResourceManager methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1536) Cleanup: Get rid of ResourceManager#get*SecretManager() methods and use the RMContext methods instead
[ https://issues.apache.org/jira/browse/YARN-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935811#comment-13935811 ] Tsuyoshi OZAWA commented on YARN-1536: -- +1, LGTM. Confirmed to pass tests correctly on local. TestRMRestart's failure has been already filed as YARN-1830. [~kkambatl], can you also check it? Cleanup: Get rid of ResourceManager#get*SecretManager() methods and use the RMContext methods instead - Key: YARN-1536 URL: https://issues.apache.org/jira/browse/YARN-1536 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Anubhav Dhoot Priority: Minor Labels: newbie Attachments: yarn-1536.patch Both ResourceManager and RMContext have methods to access the secret managers, and it should be safe (cleaner) to get rid of the ResourceManager methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1536) Cleanup: Get rid of ResourceManager#get*SecretManager() methods and use the RMContext methods instead
[ https://issues.apache.org/jira/browse/YARN-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1536: - Hadoop Flags: Reviewed Cleanup: Get rid of ResourceManager#get*SecretManager() methods and use the RMContext methods instead - Key: YARN-1536 URL: https://issues.apache.org/jira/browse/YARN-1536 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Anubhav Dhoot Priority: Minor Labels: newbie Attachments: yarn-1536.patch Both ResourceManager and RMContext have methods to access the secret managers, and it should be safe (cleaner) to get rid of the ResourceManager methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1521) Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation
[ https://issues.apache.org/jira/browse/YARN-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935821#comment-13935821 ] Xuan Gong commented on YARN-1521: - But first of all, we need to find which apis can be marked as Idempotent Here is the list of APIs that I think we can mark as Idempotent: * ResourceTracker ** registerNodeManager ** nodeHeartbeat * ResourceManagerAdministrationProtocol ** refreshQueues ** refreshNodes ** refreshSuperUserGroupsConfiguration ** refreshUserToGroupsMappings ** refreshAdminAcls ** refreshServiceAcls * ApplicationClientProtocol ** forceKillApplication ** getApplicationReport (already marked) ** getClusterMetrics ** getApplications ** getClusterNodes ** getQueueInfo ** getQueueUserAcls ** getApplicationAttemptReport ** getApplicationAttempts ** getContainerReport ** getContainers Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation - Key: YARN-1521 URL: https://issues.apache.org/jira/browse/YARN-1521 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong After YARN-1028, we add the automatically failover into RMProxy. This JIRA is to identify whether we need to add idempotent annotation and which methods can be marked as idempotent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1839) Capacity scheduler preempts an AM out. AM attempt 2 fails to launch task container with SecretManager$InvalidToken: No NMToken sent
Tassapol Athiapinya created YARN-1839: - Summary: Capacity scheduler preempts an AM out. AM attempt 2 fails to launch task container with SecretManager$InvalidToken: No NMToken sent Key: YARN-1839 URL: https://issues.apache.org/jira/browse/YARN-1839 Project: Hadoop YARN Issue Type: Bug Components: applications, capacityscheduler Affects Versions: 2.3.0 Reporter: Tassapol Athiapinya Priority: Critical Use single-node cluster. Turn on capacity scheduler preemption. Run MR sleep job as app 1. Take entire cluster. Run MR sleep job as app 2. Preempt app1 out. Wait till app 2 finishes. App 1 AM attempt 2 will start. It won't be able to launch a task container with this error stack trace in AM logs: {code} 2014-03-13 20:13:50,254 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394741557066_0001_m_00_1009: Container launch failed for container_1394741557066_0001_02_21 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for host:45454 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1839) Capacity scheduler preempts an AM out. AM attempt 2 fails to launch task container with SecretManager$InvalidToken: No NMToken sent
[ https://issues.apache.org/jira/browse/YARN-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-1839: - Assignee: Jian He Capacity scheduler preempts an AM out. AM attempt 2 fails to launch task container with SecretManager$InvalidToken: No NMToken sent --- Key: YARN-1839 URL: https://issues.apache.org/jira/browse/YARN-1839 Project: Hadoop YARN Issue Type: Bug Components: applications, capacityscheduler Affects Versions: 2.3.0 Reporter: Tassapol Athiapinya Assignee: Jian He Priority: Critical Use single-node cluster. Turn on capacity scheduler preemption. Run MR sleep job as app 1. Take entire cluster. Run MR sleep job as app 2. Preempt app1 out. Wait till app 2 finishes. App 1 AM attempt 2 will start. It won't be able to launch a task container with this error stack trace in AM logs: {code} 2014-03-13 20:13:50,254 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394741557066_0001_m_00_1009: Container launch failed for container_1394741557066_0001_02_21 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for host:45454 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1839) Capacity scheduler preempts an AM out. AM attempt 2 fails to launch task container with SecretManager$InvalidToken: No NMToken sent
[ https://issues.apache.org/jira/browse/YARN-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935849#comment-13935849 ] Jian He commented on YARN-1839: --- {code} // The node set here is used for differentiating whether the NMToken // has been issued for this node from the client's perspective. If // this is an AM container, the NMToken is issued only for RM and so // we should not update the node set. if (container.getId().getId() != 1) { nodeSet.add(container.getNodeId()); {code} This piece of code is flawed. We cannot assume AM container Id always equal to 1. If AM container Id doesn't equal to one and it's added into the node set, RM will think this NMToken has already been sent and won't sent for other normal containers which AM asks. Capacity scheduler preempts an AM out. AM attempt 2 fails to launch task container with SecretManager$InvalidToken: No NMToken sent --- Key: YARN-1839 URL: https://issues.apache.org/jira/browse/YARN-1839 Project: Hadoop YARN Issue Type: Bug Components: applications, capacityscheduler Affects Versions: 2.3.0 Reporter: Tassapol Athiapinya Assignee: Jian He Priority: Critical Use single-node cluster. Turn on capacity scheduler preemption. Run MR sleep job as app 1. Take entire cluster. Run MR sleep job as app 2. Preempt app1 out. Wait till app 2 finishes. App 1 AM attempt 2 will start. It won't be able to launch a task container with this error stack trace in AM logs: {code} 2014-03-13 20:13:50,254 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394741557066_0001_m_00_1009: Container launch failed for container_1394741557066_0001_02_21 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for host:45454 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1795) After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935850#comment-13935850 ] Jian He commented on YARN-1795: --- Hi Karthik, thanks for taking it up. YARN-1839 filed, but I'm not sure whether this jira is related to that. After YARN-713, using FairScheduler can cause an InvalidToken Exception for NMTokens Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Assignee: Karthik Kambatla Priority: Blocker Attachments: org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? Update: This also happens in an actual cluster, not just Oozie's unit tests -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1521) Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation
[ https://issues.apache.org/jira/browse/YARN-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935873#comment-13935873 ] Xuan Gong commented on YARN-1521: - APIs are not in previous list: * ApplicationMasterProtocol ** registerApplicationMaster ** finishApplicationMaster ** allocate * ResourceManagerAdministrationProtocol ** updateNodeResource * ApplicationClientProtocol ** getNewApplication ** submitApplication ** getDelegationToken ** renewDelegationToken ** cancelDelegationToken ** moveApplicationAcrossQueues Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation - Key: YARN-1521 URL: https://issues.apache.org/jira/browse/YARN-1521 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong After YARN-1028, we add the automatically failover into RMProxy. This JIRA is to identify whether we need to add idempotent annotation and which methods can be marked as idempotent. -- This message was sent by Atlassian JIRA (v6.2#6252)