[jira] [Resolved] (YARN-1554) add an env variable for the YARN AM classpath
[ https://issues.apache.org/jira/browse/YARN-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved YARN-1554. -- Resolution: Duplicate add an env variable for the YARN AM classpath - Key: YARN-1554 URL: https://issues.apache.org/jira/browse/YARN-1554 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.2.0 Reporter: Steve Loughran Priority: Minor Currently YARN apps set up their classpath via the default value {{YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH}} or an overridden property {{yarn.application.classpath}}. If you don't have the classpath right, the AM won't start up. This means the client needs to be explicitly configured with the CP. If the node manager exported the classpath property via an env variable {{YARN_APPLICATION_CLASSPATH}}, then the classpath could be set up in the AM simply by referencing that property, rather than hoping its setting is in sync. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1907) TestRMApplicationHistoryWriter#testRMWritingMassiveHistory runs slow and intermittently fails
[ https://issues.apache.org/jira/browse/YARN-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965240#comment-13965240 ] Hudson commented on YARN-1907: -- FAILURE: Integrated in Hadoop-Yarn-trunk #535 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/535/]) YARN-1907. TestRMApplicationHistoryWriter#testRMWritingMassiveHistory intermittently fails. Contributed by Mit Desai. (kihwal: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1585992) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/TestRMApplicationHistoryWriter.java TestRMApplicationHistoryWriter#testRMWritingMassiveHistory runs slow and intermittently fails - Key: YARN-1907 URL: https://issues.apache.org/jira/browse/YARN-1907 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 3.0.0, 2.5.0 Attachments: HDFS-6195.patch The test has 1 containers that it tries to cleanup. The cleanup has a timeout of 2ms in which the test sometimes cannot do the cleanup completely and gives out an Assertion Failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1910) TestAMRMTokens fails on windows
[ https://issues.apache.org/jira/browse/YARN-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965236#comment-13965236 ] Hudson commented on YARN-1910: -- FAILURE: Integrated in Hadoop-Yarn-trunk #535 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/535/]) YARN-1910. Fixed a race condition in TestAMRMTokens that causes the test to fail more often on Windows. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1586192) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestAMRMTokens.java TestAMRMTokens fails on windows --- Key: YARN-1910 URL: https://issues.apache.org/jira/browse/YARN-1910 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.4.0 Attachments: YARN-1910.1.patch, YARN-1910.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1677) Potential bugs in exception handlers
[ https://issues.apache.org/jira/browse/YARN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965253#comment-13965253 ] Ding Yuan commented on YARN-1677: - Ping. Is there anything else I can help from my side? Potential bugs in exception handlers Key: YARN-1677 URL: https://issues.apache.org/jira/browse/YARN-1677 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Ding Yuan Attachments: yarn-1677.patch Hi Yarn developers, We are a group of researchers on software reliability, and recently we did a study and found that majority of the most severe failures in hadoop are caused by bugs in exception handling logic. Therefore we built a simple checking tool that automatically detects some bug patterns that have caused some very severe failures. I am reporting some of the results for Yarn here. Any feedback is much appreciated! == Case 1: Line: 551, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java {noformat} switch (monitoringEvent.getType()) { case START_MONITORING_CONTAINER: .. .. default: // TODO: Wrong event. } {noformat} The switch fall-through (handling any potential unexpected event) is empty. Should we at least print an error message here? == == Case 2: Line: 491, File: org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java {noformat} } catch (Throwable e) { // TODO Better error handling. Thread can die with the rest of the // NM still running. LOG.error(Caught exception in status-updater, e); } {noformat} The handler of this very general exception only logs the error. The TODO seems to indicate it is not sufficient. == == Case 3: Line: 861, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java for (LocalResourceStatus stat : remoteResourceStatuses) { LocalResource rsrc = stat.getResource(); LocalResourceRequest req = null; try { req = new LocalResourceRequest(rsrc); } catch (URISyntaxException e) { // TODO fail? Already translated several times... } The handler for URISyntaxException is empty, and the TODO seems to indicate it is not sufficient. The same code pattern can also be found at: Line: 901, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Line: 838, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Line: 878, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java At line: 803, File: org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java, the handler of URISyntaxException also seems not sufficient: {noformat} try { shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI( shellScriptPath))); } catch (URISyntaxException e) { LOG.error(Error when trying to use shell script path specified + in env, path= + shellScriptPath); e.printStackTrace(); // A failure scenario on bad input such as invalid shell script path // We know we cannot continue launching the container // so we should release it. // TODO numCompletedContainers.incrementAndGet(); numFailedContainers.incrementAndGet(); return; } {noformat} == == Case 4: Line: 627, File: org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java {noformat} try { /* keep the master in sync with the state machine */ this.stateMachine.doTransition(event.getType(), event); } catch (InvalidStateTransitonException e) { LOG.error(Can't handle this event at current state, e); /* TODO fail the application on the failed transition */ } {noformat} The handler of this exception only logs the error. The TODO seems to indicate it is not sufficient. This exact same code pattern can also be found at: Line: 573, File: org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java == == Case 5: empty handler for exception: java.lang.InterruptedException Line: 123, File: org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java {noformat} public void join() { if(proxyServer != null) { try
[jira] [Updated] (YARN-322) Add cpu information to queue metrics
[ https://issues.apache.org/jira/browse/YARN-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-322: --- Fix Version/s: (was: 2.4.0) 2.5.0 Add cpu information to queue metrics Key: YARN-322 URL: https://issues.apache.org/jira/browse/YARN-322 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, scheduler Reporter: Arun C Murthy Assignee: Arun C Murthy Fix For: 2.5.0 Post YARN-2 we need to add cpu information to queue metrics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1334) YARN should give more info on errors when running failed distributed shell command
[ https://issues.apache.org/jira/browse/YARN-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1334: Fix Version/s: (was: 2.4.0) 2.5.0 YARN should give more info on errors when running failed distributed shell command -- Key: YARN-1334 URL: https://issues.apache.org/jira/browse/YARN-1334 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.3.0 Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.5.0 Attachments: YARN-1334.1.patch Run incorrect command such as: /usr/bin/yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar distributedshell jar -shell_command ./test1.sh -shell_script ./ would show shell exit code exception with no useful message. It should print out sysout/syserr of containers/AM of why it is failing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-650) User guide for preemption
[ https://issues.apache.org/jira/browse/YARN-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-650: --- Fix Version/s: (was: 2.4.0) 2.5.0 User guide for preemption - Key: YARN-650 URL: https://issues.apache.org/jira/browse/YARN-650 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Chris Douglas Priority: Minor Fix For: 2.5.0 Attachments: Y650-0.patch YARN-45 added a protocol for the RM to ask back resources. The docs on writing YARN applications should include a section on how to interpret this message. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1514: Fix Version/s: (was: 2.4.0) 2.5.0 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.5.0 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1722) AMRMProtocol should have a way of getting all the nodes in the cluster
[ https://issues.apache.org/jira/browse/YARN-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1722: Fix Version/s: (was: 2.4.0) 2.5.0 AMRMProtocol should have a way of getting all the nodes in the cluster -- Key: YARN-1722 URL: https://issues.apache.org/jira/browse/YARN-1722 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Bikas Saha Fix For: 2.5.0 There is no way for an AM to find out the names of all the nodes in the cluster via the AMRMProtocol. An AM can only at best ask for containers at * location. The only way to get that information is via the ClientRMProtocol but that is secured by Kerberos or RMDelegationToken while the AM has an AMRMToken. This is a pretty important piece of missing functionality. There are other jiras opened about getting cluster topology etc. but they havent been addressed due to a clear definition of cluster topology perhaps. Adding a means to at least get the node information would be a good first step. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-153) PaaS on YARN: an YARN application to demonstrate that YARN can be used as a PaaS
[ https://issues.apache.org/jira/browse/YARN-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-153: --- Fix Version/s: (was: 2.4.0) 2.5.0 PaaS on YARN: an YARN application to demonstrate that YARN can be used as a PaaS Key: YARN-153 URL: https://issues.apache.org/jira/browse/YARN-153 Project: Hadoop YARN Issue Type: New Feature Reporter: Jacob Jaigak Song Assignee: Jacob Jaigak Song Fix For: 2.5.0 Attachments: HADOOPasPAAS_Architecture.pdf, MAPREDUCE-4393.patch, MAPREDUCE-4393.patch, MAPREDUCE-4393.patch, MAPREDUCE4393.patch, MAPREDUCE4393.patch Original Estimate: 336h Time Spent: 336h Remaining Estimate: 0h This application is to demonstrate that YARN can be used for non-mapreduce applications. As Hadoop has already been adopted and deployed widely and its deployment in future will be highly increased, we thought that it's a good potential to be used as PaaS. I have implemented a proof of concept to demonstrate that YARN can be used as a PaaS (Platform as a Service). I have done a gap analysis against VMware's Cloud Foundry and tried to achieve as many PaaS functionalities as possible on YARN. I'd like to check in this POC as a YARN example application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1234) Container localizer logs are not created in secured cluster
[ https://issues.apache.org/jira/browse/YARN-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1234: Fix Version/s: (was: 2.4.0) 2.5.0 Container localizer logs are not created in secured cluster Key: YARN-1234 URL: https://issues.apache.org/jira/browse/YARN-1234 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Fix For: 2.5.0 When we are running ContainerLocalizer in secured cluster we potentially are not creating any log file to track log messages. This will be helpful in potentially identifying ContainerLocalization issues in secured cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-314) Schedulers should allow resource requests of different sizes at the same priority and location
[ https://issues.apache.org/jira/browse/YARN-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-314: --- Fix Version/s: (was: 2.4.0) 2.5.0 Schedulers should allow resource requests of different sizes at the same priority and location -- Key: YARN-314 URL: https://issues.apache.org/jira/browse/YARN-314 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.5.0 Currently, resource requests for the same container and locality are expected to all be the same size. While it it doesn't look like it's needed for apps currently, and can be circumvented by specifying different priorities if absolutely necessary, it seems to me that the ability to request containers with different resource requirements at the same priority level should be there for the future and for completeness sake. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-113) WebAppProxyServlet must use SSLFactory for the HttpClient connections
[ https://issues.apache.org/jira/browse/YARN-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-113: --- Fix Version/s: (was: 2.4.0) 2.5.0 WebAppProxyServlet must use SSLFactory for the HttpClient connections - Key: YARN-113 URL: https://issues.apache.org/jira/browse/YARN-113 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Fix For: 2.5.0 The HttpClient must be configured to use the SSLFactory when the web UIs are over HTTPS, otherwise the proxy servlet fails to connect to the AM because of unknown (self-signed) certificates. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1723) AMRMClientAsync missing blacklist addition and removal functionality
[ https://issues.apache.org/jira/browse/YARN-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1723: Fix Version/s: (was: 2.4.0) 2.5.0 AMRMClientAsync missing blacklist addition and removal functionality Key: YARN-1723 URL: https://issues.apache.org/jira/browse/YARN-1723 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Bikas Saha Fix For: 2.5.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1477) No Submit time on AM web pages
[ https://issues.apache.org/jira/browse/YARN-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1477: Fix Version/s: (was: 2.4.0) 2.5.0 No Submit time on AM web pages -- Key: YARN-1477 URL: https://issues.apache.org/jira/browse/YARN-1477 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Chen He Assignee: Chen He Labels: features Fix For: 2.5.0 Similar to MAPREDUCE-5052, This is a fix on AM side. Add submitTime field to the AM's web services REST API -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1147) Add end-to-end tests for HA
[ https://issues.apache.org/jira/browse/YARN-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1147: Fix Version/s: (was: 2.4.0) 2.5.0 Add end-to-end tests for HA --- Key: YARN-1147 URL: https://issues.apache.org/jira/browse/YARN-1147 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.5.0 While individual sub-tasks add tests for the code they include, it will be handy to write end-to-end tests for HA including some stress testing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1156) Change NodeManager AllocatedGB and AvailableGB metrics to show decimal values
[ https://issues.apache.org/jira/browse/YARN-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1156: Fix Version/s: (was: 2.4.0) 2.5.0 Change NodeManager AllocatedGB and AvailableGB metrics to show decimal values - Key: YARN-1156 URL: https://issues.apache.org/jira/browse/YARN-1156 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.1.0-beta Reporter: Akira AJISAKA Assignee: Tsuyoshi OZAWA Priority: Minor Labels: metrics, newbie Fix For: 2.5.0 Attachments: YARN-1156.1.patch AllocatedGB and AvailableGB metrics are now integer type. If there are four times 500MB memory allocation to container, AllocatedGB is incremented four times by {{(int)500/1024}}, which means 0. That is, the memory size allocated is actually 2000MB, but the metrics shows 0GB. Let's use float type for these metrics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-965) NodeManager Metrics containersRunning is not correct When localizing container process is failed or killed
[ https://issues.apache.org/jira/browse/YARN-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-965: --- Fix Version/s: (was: 2.4.0) 2.5.0 NodeManager Metrics containersRunning is not correct When localizing container process is failed or killed -- Key: YARN-965 URL: https://issues.apache.org/jira/browse/YARN-965 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.4-alpha Environment: suse linux Reporter: Li Yuan Fix For: 2.5.0 When successfully launched a container, container state from LOCALIZED to RUNNING, containersRunning ++. Container state from EXITED_WITH_FAILURE or KILLING to DONE, containersRunning--. However, state EXITED_WITH_FAILURE or KILLING could come from LOCALIZING(LOCALIZED), not RUNNING, which caused containersRunningis less than the actual number. Further more, Metrics is wrong, containersLaunched != containersCompleted + containersFailed + containersKilled + containersRunning + containersIniting -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1142) MiniYARNCluster web ui does not work properly
[ https://issues.apache.org/jira/browse/YARN-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1142: Fix Version/s: (was: 2.4.0) 2.5.0 MiniYARNCluster web ui does not work properly - Key: YARN-1142 URL: https://issues.apache.org/jira/browse/YARN-1142 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Alejandro Abdelnur Fix For: 2.5.0 When going to the RM http port, the NM web ui is displayed. It seems there is a singleton somewhere that breaks things when RM NMs run in the same process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-308) Improve documentation about what asks means in AMRMProtocol
[ https://issues.apache.org/jira/browse/YARN-308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-308: --- Fix Version/s: (was: 2.4.0) 2.5.0 Improve documentation about what asks means in AMRMProtocol - Key: YARN-308 URL: https://issues.apache.org/jira/browse/YARN-308 Project: Hadoop YARN Issue Type: Sub-task Components: api, documentation, resourcemanager Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.5.0 Attachments: YARN-308.patch It's unclear to me from reading the javadoc exactly what asks means when the AM sends a heartbeat to the RM. Is the AM supposed to send a list of all resources that it is waiting for? Or just inform the RM about new ones that it wants? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-614: --- Fix Version/s: (was: 2.4.0) 2.5.0 Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1 -- Key: YARN-614 URL: https://issues.apache.org/jira/browse/YARN-614 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Chris Riccomini Fix For: 2.5.0 Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch, YARN-614-3.patch, YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1621) Add CLI to list states of yarn container-IDs/hosts
[ https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1621: Fix Version/s: (was: 2.4.0) 2.5.0 Add CLI to list states of yarn container-IDs/hosts -- Key: YARN-1621 URL: https://issues.apache.org/jira/browse/YARN-1621 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Tassapol Athiapinya Fix For: 2.5.0 As more applications are moved to YARN, we need generic CLI to list states of yarn containers and their hosts. Today if YARN application running in a container does hang, there is no way other than to manually kill its process. For each running application, it is useful to differentiate between running/succeeded/failed/killed containers. {code:title=proposed yarn cli} $ yarn application -list-containers appId status where status is one of running/succeeded/killed/failed/all {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1327) Fix nodemgr native compilation problems on FreeBSD9
[ https://issues.apache.org/jira/browse/YARN-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1327: Fix Version/s: (was: 2.4.0) 2.5.0 Fix nodemgr native compilation problems on FreeBSD9 --- Key: YARN-1327 URL: https://issues.apache.org/jira/browse/YARN-1327 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Reporter: Radim Kolar Assignee: Radim Kolar Fix For: 3.0.0, 2.5.0 Attachments: nodemgr-portability.txt There are several portability problems preventing from compiling native component on freebsd. 1. libgen.h is not included. correct function prototype is there but linux glibc has workaround to define it for user if libgen.h is not directly included. Include this file directly. 2. query max size of login name using sysconf. it follows same code style like rest of code using sysconf too. 3. cgroups are linux only feature, make conditional compile and return error if mount_cgroup is attempted on non linux OS 4. do not use posix function setpgrp() since it clashes with same function from BSD 4.2, use equivalent function. After inspecting glibc sources its just shortcut to setpgid(0,0) These changes makes it compile on both linux and freebsd. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-160: --- Fix Version/s: (was: 2.4.0) 2.5.0 nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Fix For: 2.5.0 As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-745) Move UnmanagedAMLauncher to yarn client package
[ https://issues.apache.org/jira/browse/YARN-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-745: --- Fix Version/s: (was: 2.4.0) 2.5.0 Move UnmanagedAMLauncher to yarn client package --- Key: YARN-745 URL: https://issues.apache.org/jira/browse/YARN-745 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Fix For: 2.5.0 Its currently sitting in yarn applications project which sounds wrong. client project sounds better since it contains the utilities/libraries that clients use to write and debug yarn applications. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-996) REST API support for node resource configuration
[ https://issues.apache.org/jira/browse/YARN-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965331#comment-13965331 ] Thomas Graves commented on YARN-996: Have you tested this with admin acls? Taking a quick look at the code I don't see that the updateNodeResource is being properly protected. I guess that is a separate jira though since its already in there. REST API support for node resource configuration Key: YARN-996 URL: https://issues.apache.org/jira/browse/YARN-996 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, scheduler Reporter: Junping Du Assignee: Kenji Kikushima Attachments: YARN-996-sample.patch Besides admin protocol and CLI, REST API should also be supported for node resource configuration -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1910) TestAMRMTokens fails on windows
[ https://issues.apache.org/jira/browse/YARN-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965341#comment-13965341 ] Hudson commented on YARN-1910: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1753 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1753/]) YARN-1910. Fixed a race condition in TestAMRMTokens that causes the test to fail more often on Windows. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1586192) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestAMRMTokens.java TestAMRMTokens fails on windows --- Key: YARN-1910 URL: https://issues.apache.org/jira/browse/YARN-1910 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.4.0 Attachments: YARN-1910.1.patch, YARN-1910.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1477) No Submit time on AM web pages
[ https://issues.apache.org/jira/browse/YARN-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965360#comment-13965360 ] Chen He commented on YARN-1477: --- I am working on it. Thank you for remindering. No Submit time on AM web pages -- Key: YARN-1477 URL: https://issues.apache.org/jira/browse/YARN-1477 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Chen He Assignee: Chen He Labels: features Fix For: 2.5.0 Similar to MAPREDUCE-5052, This is a fix on AM side. Add submitTime field to the AM's web services REST API -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2
[ https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-1906: Attachment: YARN-1906.patch Attaching patch for trunk and branch-2 TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2 --- Key: YARN-1906 URL: https://issues.apache.org/jira/browse/YARN-1906 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai Fix For: 3.0.0, 2.5.0 Attachments: YARN-1906.patch, YARN-1906.patch Here is the output of the format {noformat} testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 9.757 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He reassigned YARN-1857: - Assignee: Chen He CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Chen He Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2
[ https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965449#comment-13965449 ] Mit Desai commented on YARN-1906: - *Explanation of the changes made* # Two assert statements were removed from the test that were verifying the pending application count increased from 0 to 1. This is an intermediate result (which is good to test if consistent). In this case, the intermediate results are inconsistent as the application transition to pending state can be or cannot be detected when the assert is called. The aim of the test is to check the queueMetrics value before and after restart. And this is working as expected without the assert for pendingApps. I have tested the patch by running the test 25 times and it passes. # The assertQueueMetrics function was not properly implemented. assertEquals() takes in 2 parameters 1. expected value, 2. Actual value. All the asserts in the assertQueueMetrics were implemented in the opposite way leading to a wrong error message on an assert failure. TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2 --- Key: YARN-1906 URL: https://issues.apache.org/jira/browse/YARN-1906 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai Fix For: 3.0.0, 2.5.0 Attachments: YARN-1906.patch, YARN-1906.patch Here is the output of the format {noformat} testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 9.757 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2
[ https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965492#comment-13965492 ] Zhijie Shen commented on YARN-1906: --- 1. I looked into exception again: {code} at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706) {code} It seems the test fails at {code} // finish the AMs finishApplicationMaster(loadedApp1, rm2, nm1, am1); assertQueueMetrics(qm2, 1, 0, 0, 1); {code} Race condition here? Should we waitForState here before assertion? 2. One suggestion on assertQueueMetrics: It would be better to add the messages for the for assertion sentences, such that when an exception happens, we can easily see which metric is wrong. TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2 --- Key: YARN-1906 URL: https://issues.apache.org/jira/browse/YARN-1906 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai Fix For: 3.0.0, 2.5.0 Attachments: YARN-1906.patch, YARN-1906.patch Here is the output of the format {noformat} testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 9.757 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1921) Allow to override queue prefix, where new queues created
Andrey Stepachev created YARN-1921: -- Summary: Allow to override queue prefix, where new queues created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eats resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2
[ https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965500#comment-13965500 ] Hadoop QA commented on YARN-1906: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639589/YARN-1906.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3542//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3542//console This message is automatically generated. TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2 --- Key: YARN-1906 URL: https://issues.apache.org/jira/browse/YARN-1906 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai Fix For: 3.0.0, 2.5.0 Attachments: YARN-1906.patch, YARN-1906.patch Here is the output of the format {noformat} testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 9.757 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1921) Allow to override queue prefix, where new queues created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Stepachev updated YARN-1921: --- Attachment: YARN-1921.patch Allow to override queue prefix, where new queues created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eats resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1910) TestAMRMTokens fails on windows
[ https://issues.apache.org/jira/browse/YARN-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1910: -- Fix Version/s: (was: 2.4.0) 2.4.1 TestAMRMTokens fails on windows --- Key: YARN-1910 URL: https://issues.apache.org/jira/browse/YARN-1910 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.4.1 Attachments: YARN-1910.1.patch, YARN-1910.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2
[ https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1906: -- Target Version/s: 2.4.1 Affects Version/s: 2.4.0 Fix Version/s: (was: 2.5.0) (was: 3.0.0) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2 --- Key: YARN-1906 URL: https://issues.apache.org/jira/browse/YARN-1906 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-1906.patch, YARN-1906.patch Here is the output of the format {noformat} testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 9.757 sec FAILURE! java.lang.AssertionError: expected:2 but was:1 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965515#comment-13965515 ] Arun C Murthy commented on YARN-1769: - Sorry guys, been slammed. I'll take a look at this presently. Tx. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1920) TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows
[ https://issues.apache.org/jira/browse/YARN-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965546#comment-13965546 ] Vinod Kumar Vavilapalli commented on YARN-1920: --- Putting more context: The test-failure is happening because the test before the failing test wasn't deleting the file successfully due to the file-handle leak. In all the tests in this test-case, we always expect that a new history-store file to be created for each of the tests and due to the leak, that assumption was violated. The leak was leaving the stream open and so the file couldn't be deleted, though it works fine on Linux/Mac. TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows --- Key: YARN-1920 URL: https://issues.apache.org/jira/browse/YARN-1920 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: YARN-1920.txt Though this was only failing in Windows, after debugging, I realized that the test fails because we are leaking a file-handle in the history service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1677) Potential bugs in exception handlers
[ https://issues.apache.org/jira/browse/YARN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1396#comment-1396 ] Devaraj K commented on YARN-1677: - Thanks Ding for taking up these, Appreciate your work. Could you split these into multiple Jira's instead of grouping into single Jira. And also can you add the tests for the changes, please refer http://wiki.apache.org/hadoop/HowToContribute#Making_Changes. After attaching the patch for the issue, please click on the 'Submit Patch' button so that Jenkins can run the patch and also any one can review it. Potential bugs in exception handlers Key: YARN-1677 URL: https://issues.apache.org/jira/browse/YARN-1677 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Ding Yuan Attachments: yarn-1677.patch Hi Yarn developers, We are a group of researchers on software reliability, and recently we did a study and found that majority of the most severe failures in hadoop are caused by bugs in exception handling logic. Therefore we built a simple checking tool that automatically detects some bug patterns that have caused some very severe failures. I am reporting some of the results for Yarn here. Any feedback is much appreciated! == Case 1: Line: 551, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java {noformat} switch (monitoringEvent.getType()) { case START_MONITORING_CONTAINER: .. .. default: // TODO: Wrong event. } {noformat} The switch fall-through (handling any potential unexpected event) is empty. Should we at least print an error message here? == == Case 2: Line: 491, File: org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java {noformat} } catch (Throwable e) { // TODO Better error handling. Thread can die with the rest of the // NM still running. LOG.error(Caught exception in status-updater, e); } {noformat} The handler of this very general exception only logs the error. The TODO seems to indicate it is not sufficient. == == Case 3: Line: 861, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java for (LocalResourceStatus stat : remoteResourceStatuses) { LocalResource rsrc = stat.getResource(); LocalResourceRequest req = null; try { req = new LocalResourceRequest(rsrc); } catch (URISyntaxException e) { // TODO fail? Already translated several times... } The handler for URISyntaxException is empty, and the TODO seems to indicate it is not sufficient. The same code pattern can also be found at: Line: 901, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Line: 838, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Line: 878, File: org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java At line: 803, File: org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java, the handler of URISyntaxException also seems not sufficient: {noformat} try { shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI( shellScriptPath))); } catch (URISyntaxException e) { LOG.error(Error when trying to use shell script path specified + in env, path= + shellScriptPath); e.printStackTrace(); // A failure scenario on bad input such as invalid shell script path // We know we cannot continue launching the container // so we should release it. // TODO numCompletedContainers.incrementAndGet(); numFailedContainers.incrementAndGet(); return; } {noformat} == == Case 4: Line: 627, File: org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java {noformat} try { /* keep the master in sync with the state machine */ this.stateMachine.doTransition(event.getType(), event); } catch (InvalidStateTransitonException e) { LOG.error(Can't handle this event at current state, e); /* TODO fail the application on the failed transition */ } {noformat} The handler of this exception only logs the error. The TODO seems to indicate it is not sufficient. This exact same code pattern can also be found at:
[jira] [Created] (YARN-1922) Process group remains alive after container process is killed externally
Billie Rinaldi created YARN-1922: Summary: Process group remains alive after container process is killed externally Key: YARN-1922 URL: https://issues.apache.org/jira/browse/YARN-1922 Project: Hadoop YARN Issue Type: Bug Environment: CentOS 6.4 Reporter: Billie Rinaldi Assignee: Billie Rinaldi If the main container process is killed externally, ContainerLaunch does not kill the rest of the process group. Before sending the event that results in the ContainerLaunch.containerCleanup method being called, ContainerLaunch sets the completed flag to true. Then when cleaning up, it doesn't try to read the pid file if the completed flag is true. If it read the pid file, it would proceed to send the container a kill signal. In the case of the DefaultContainerExecutor, this would kill the process group. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1920) TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows
[ https://issues.apache.org/jira/browse/YARN-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1920: -- Attachment: YARN-1920.2.patch TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows --- Key: YARN-1920 URL: https://issues.apache.org/jira/browse/YARN-1920 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: YARN-1920.2.patch, YARN-1920.txt Though this was only failing in Windows, after debugging, I realized that the test fails because we are leaking a file-handle in the history service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1920) TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows
[ https://issues.apache.org/jira/browse/YARN-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965571#comment-13965571 ] Zhijie Shen commented on YARN-1920: --- The reason of the file sounds right to me. The patch looks good as well. I made a minor change to the patch to make each LOG.error to record the exception instance. Will commit it once Jenkins +1 TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows --- Key: YARN-1920 URL: https://issues.apache.org/jira/browse/YARN-1920 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: YARN-1920.2.patch, YARN-1920.txt Though this was only failing in Windows, after debugging, I realized that the test fails because we are leaking a file-handle in the history service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1922) Process group remains alive after container process is killed externally
[ https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-1922: - Attachment: YARN-1922.1.patch Process group remains alive after container process is killed externally Key: YARN-1922 URL: https://issues.apache.org/jira/browse/YARN-1922 Project: Hadoop YARN Issue Type: Bug Environment: CentOS 6.4 Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1922.1.patch If the main container process is killed externally, ContainerLaunch does not kill the rest of the process group. Before sending the event that results in the ContainerLaunch.containerCleanup method being called, ContainerLaunch sets the completed flag to true. Then when cleaning up, it doesn't try to read the pid file if the completed flag is true. If it read the pid file, it would proceed to send the container a kill signal. In the case of the DefaultContainerExecutor, this would kill the process group. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-862) ResourceManager and NodeManager versions should match on node registration or error out
[ https://issues.apache.org/jira/browse/YARN-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965591#comment-13965591 ] Chen He commented on YARN-862: -- Since the rolling upgrade is checked in to Hadoop and YARN-819 is resolved. I will close this one. ResourceManager and NodeManager versions should match on node registration or error out --- Key: YARN-862 URL: https://issues.apache.org/jira/browse/YARN-862 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 0.23.8 Reporter: Robert Parker Assignee: Robert Parker Attachments: YARN-862-b0.23-v1.patch, YARN-862-b0.23-v2.patch For branch-0.23 the versions of the node manager and the resource manager should match to complete a successful registration. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-862) ResourceManager and NodeManager versions should match on node registration or error out
[ https://issues.apache.org/jira/browse/YARN-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965593#comment-13965593 ] Chen He commented on YARN-862: -- Thank you for the patch, [~reparker]. ResourceManager and NodeManager versions should match on node registration or error out --- Key: YARN-862 URL: https://issues.apache.org/jira/browse/YARN-862 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 0.23.8 Reporter: Robert Parker Assignee: Robert Parker Attachments: YARN-862-b0.23-v1.patch, YARN-862-b0.23-v2.patch For branch-0.23 the versions of the node manager and the resource manager should match to complete a successful registration. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1920) TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows
[ https://issues.apache.org/jira/browse/YARN-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965616#comment-13965616 ] Hadoop QA commented on YARN-1920: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639612/YARN-1920.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3543//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3543//console This message is automatically generated. TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows --- Key: YARN-1920 URL: https://issues.apache.org/jira/browse/YARN-1920 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: YARN-1920.2.patch, YARN-1920.txt Though this was only failing in Windows, after debugging, I realized that the test fails because we are leaking a file-handle in the history service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1914) Test TestFSDownload.testDownloadPublicWithStatCache fails on Windows
[ https://issues.apache.org/jira/browse/YARN-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1914: -- Attachment: apache-yarn-1914.2.patch Same patch as before but the code comment matching branch-1. Will check this in when Jenkins says okay.. Test TestFSDownload.testDownloadPublicWithStatCache fails on Windows Key: YARN-1914 URL: https://issues.apache.org/jira/browse/YARN-1914 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-1914.0.patch, apache-yarn-1914.1.patch, apache-yarn-1914.2.patch The TestFSDownload.testDownloadPublicWithStatCache test in hadoop-yarn-common consistently fails on Windows environments. The root cause is that the test checks for execute permission for all users on every ancestor of the target directory. In windows, by default, group Everyone has no permissions on any directory in the install drive. It's unreasonable to expect this test to pass and we should skip it on Windows. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster
Anubhav Dhoot created YARN-1923: --- Summary: Make FairScheduler resource ratio calculations terminate faster Key: YARN-1923 URL: https://issues.apache.org/jira/browse/YARN-1923 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1921) Allow to override queue prefix, where new queues created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965695#comment-13965695 ] Sandy Ryza commented on YARN-1921: -- Hi [~octo47], thanks for the patch, this is a feature I think we definitely need. Similar work has already been proposed on YARN-1864. Mind chiming in on the discussion over there? Allow to override queue prefix, where new queues created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eats resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1914) Test TestFSDownload.testDownloadPublicWithStatCache fails on Windows
[ https://issues.apache.org/jira/browse/YARN-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965697#comment-13965697 ] Hadoop QA commented on YARN-1914: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639618/apache-yarn-1914.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3544//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3544//console This message is automatically generated. Test TestFSDownload.testDownloadPublicWithStatCache fails on Windows Key: YARN-1914 URL: https://issues.apache.org/jira/browse/YARN-1914 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-1914.0.patch, apache-yarn-1914.1.patch, apache-yarn-1914.2.patch The TestFSDownload.testDownloadPublicWithStatCache test in hadoop-yarn-common consistently fails on Windows environments. The root cause is that the test checks for execute permission for all users on every ancestor of the target directory. In windows, by default, group Everyone has no permissions on any directory in the install drive. It's unreasonable to expect this test to pass and we should skip it on Windows. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1920) TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows
[ https://issues.apache.org/jira/browse/YARN-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965700#comment-13965700 ] Zhijie Shen commented on YARN-1920: --- Committed to trunk, branch-2 and branch-2.4. Thanks, Vinod! TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows --- Key: YARN-1920 URL: https://issues.apache.org/jira/browse/YARN-1920 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Labels: test, windows Fix For: 2.4.1 Attachments: YARN-1920.2.patch, YARN-1920.txt Though this was only failing in Windows, after debugging, I realized that the test fails because we are leaking a file-handle in the history service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1920) TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows
[ https://issues.apache.org/jira/browse/YARN-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1920: -- Labels: test windows (was: ) TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows --- Key: YARN-1920 URL: https://issues.apache.org/jira/browse/YARN-1920 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Labels: test, windows Fix For: 2.4.1 Attachments: YARN-1920.2.patch, YARN-1920.txt Though this was only failing in Windows, after debugging, I realized that the test fails because we are leaking a file-handle in the history service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1864) Fair Scheduler Dynamic Hierarchical User Queues
[ https://issues.apache.org/jira/browse/YARN-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965701#comment-13965701 ] Andrey Stepachev commented on YARN-1864: I have a different solution. My patch https://issues.apache.org/jira/browse/YARN-1921 smaller and simplier allows to override queue prefix. In that case it is possible to place users in different queues for different groups. Fair Scheduler Dynamic Hierarchical User Queues --- Key: YARN-1864 URL: https://issues.apache.org/jira/browse/YARN-1864 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Ashwin Shankar Labels: scheduler Attachments: YARN-1864-v1.txt In Fair Scheduler, we want to be able to create user queues under any parent queue in the hierarchy. For eg. Say user1 submits a job to a parent queue called root.allUserQueues, we want be able to create a new queue called root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted by this user to root.allUserQueues will be run in this newly created root.allUserQueues.user1. This is very similar to the 'user-as-default' feature in Fair Scheduler which creates user queues under root queue. But we want the ability to create user queues under ANY parent queue. Why do we want this ? 1. Preemption : these dynamically created user queues can preempt each other if its fair share is not met. So there is fairness among users. User queues can also preempt other non-user leaf queue as well if below fair share. 2. Allocation to user queues : we want all the user queries(adhoc) to consume only a fraction of resources in the shared cluster. By creating this feature,we could do that by giving a fair share to the parent user queue which is then redistributed to all the dynamically created user queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1910) TestAMRMTokens fails on windows
[ https://issues.apache.org/jira/browse/YARN-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965713#comment-13965713 ] Hudson commented on YARN-1910: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1728 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1728/]) YARN-1910. Fixed a race condition in TestAMRMTokens that causes the test to fail more often on Windows. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1586192) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestAMRMTokens.java TestAMRMTokens fails on windows --- Key: YARN-1910 URL: https://issues.apache.org/jira/browse/YARN-1910 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.4.1 Attachments: YARN-1910.1.patch, YARN-1910.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1907) TestRMApplicationHistoryWriter#testRMWritingMassiveHistory runs slow and intermittently fails
[ https://issues.apache.org/jira/browse/YARN-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965717#comment-13965717 ] Hudson commented on YARN-1907: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1728 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1728/]) YARN-1907. TestRMApplicationHistoryWriter#testRMWritingMassiveHistory intermittently fails. Contributed by Mit Desai. (kihwal: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1585992) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/TestRMApplicationHistoryWriter.java TestRMApplicationHistoryWriter#testRMWritingMassiveHistory runs slow and intermittently fails - Key: YARN-1907 URL: https://issues.apache.org/jira/browse/YARN-1907 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 3.0.0, 2.5.0 Attachments: HDFS-6195.patch The test has 1 containers that it tries to cleanup. The cleanup has a timeout of 2ms in which the test sometimes cannot do the cleanup completely and gives out an Assertion Failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster
[ https://issues.apache.org/jira/browse/YARN-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1923: Attachment: YARN-1923.patch Make FairScheduler resource ratio calculations terminate faster --- Key: YARN-1923 URL: https://issues.apache.org/jira/browse/YARN-1923 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-1923.patch In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1920) TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows
[ https://issues.apache.org/jira/browse/YARN-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965731#comment-13965731 ] Hudson commented on YARN-1920: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5491 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5491/]) YARN-1920. Fixed TestFileSystemApplicationHistoryStore failure on windows. Contributed by Vinod Kumar Vavilapalli. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1586414) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java TestFileSystemApplicationHistoryStore.testMissingApplicationAttemptHistoryData fails in windows --- Key: YARN-1920 URL: https://issues.apache.org/jira/browse/YARN-1920 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Labels: test, windows Fix For: 2.4.1 Attachments: YARN-1920.2.patch, YARN-1920.txt Though this was only failing in Windows, after debugging, I realized that the test fails because we are leaking a file-handle in the history service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1921) Allow to override queue prefix, where new queues created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965732#comment-13965732 ] Andrey Stepachev commented on YARN-1921: Thanks for pointing on similar solution, but solution with overriding prefix looks a bit cleaner and local (i.e. we can comprehend what is going on right from the policy definition). And not intrusive at all (interfaces and methods signatures are the same) Allow to override queue prefix, where new queues created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eats resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1921) Allow to override queue prefix, where new queues created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Stepachev updated YARN-1921: --- Description: Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eat resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} was: Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eats resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} Allow to override queue prefix, where new queues created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eat resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to
[jira] [Updated] (YARN-1921) Allow to override queue prefix, where new queues will be created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Stepachev updated YARN-1921: --- Summary: Allow to override queue prefix, where new queues will be created (was: Allow to override queue prefix, where new queues created) Allow to override queue prefix, where new queues will be created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eat resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1853) Allow containers to be ran under real user even in insecure mode
[ https://issues.apache.org/jira/browse/YARN-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965739#comment-13965739 ] Hadoop QA commented on YARN-1853: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12637429/YARN-1853.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3545//console This message is automatically generated. Allow containers to be ran under real user even in insecure mode Key: YARN-1853 URL: https://issues.apache.org/jira/browse/YARN-1853 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.2.0 Reporter: Andrey Stepachev Attachments: YARN-1853.patch, YARN-1853.patch Currently unsecure cluster runs all containers under one user (typically nobody). That is not appropriate, because yarn applications doesn't play well with hdfs having enabled permissions. Yarn applications try to write data (as expected) into /user/nobody regardless of user, who launched application. Another sideeffect is that it is not possible to configure cgroups for particular users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT
[ https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965753#comment-13965753 ] Chen He commented on YARN-126: -- The patch is out of date. Hi, SAISSY, can you update your patch? yarn rmadmin help message contains reference to hadoop cli and JT - Key: YARN-126 URL: https://issues.apache.org/jira/browse/YARN-126 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.0.3-alpha Reporter: Thomas Graves Assignee: Rémy SAISSY Labels: usability Attachments: YARN-126.patch has option to specify a job tracker and the last line for general command line syntax had bin/hadoop command [genericOptions] [commandOptions] ran yarn rmadmin to get usage: RMAdmin Usage: java RMAdmin [-refreshQueues] [-refreshNodes] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-refreshAdminAcls] [-refreshServiceAcl] [-help [cmd]] Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1914) Test TestFSDownload.testDownloadPublicWithStatCache fails on Windows
[ https://issues.apache.org/jira/browse/YARN-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965763#comment-13965763 ] Hudson commented on YARN-1914: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5492 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5492/]) YARN-1914. Fixed resource-download on NodeManagers to skip permission verification of public cache files in Windows+local file-system environment. Contribued by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1586434) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java Test TestFSDownload.testDownloadPublicWithStatCache fails on Windows Key: YARN-1914 URL: https://issues.apache.org/jira/browse/YARN-1914 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.4.1 Attachments: apache-yarn-1914.0.patch, apache-yarn-1914.1.patch, apache-yarn-1914.2.patch The TestFSDownload.testDownloadPublicWithStatCache test in hadoop-yarn-common consistently fails on Windows environments. The root cause is that the test checks for execute permission for all users on every ancestor of the target directory. In windows, by default, group Everyone has no permissions on any directory in the install drive. It's unreasonable to expect this test to pass and we should skip it on Windows. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1106) The RM should point the tracking url to the RM app page if its empty
[ https://issues.apache.org/jira/browse/YARN-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965777#comment-13965777 ] Chen He commented on YARN-1106: --- Thank you for the patch [~tgraves]. I apply the patch to trunk and get same test failure that [~jeagles] mentioned. Here is the failure message. Tests run: 62, Failures: 2, Errors: 0, Skipped: 1, Time elapsed: 5.285 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions testNoTrackingUrlSetRMAppPageComplete[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions) Time elapsed: 0.052 sec FAILURE! org.junit.ComparisonFailure: expected:[proxy:8088/cluster/app/application_1397159363386_0031] but was:[N/A] at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions.testNoTrackingUrlSetRMAppPageComplete(TestRMAppAttemptTransitions.java:1252) testNoTrackingUrlSetRMAppPageComplete[1](org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions) Time elapsed: 0.043 sec FAILURE! org.junit.ComparisonFailure: expected:[proxy:8088/cluster/app/application_1397159363386_0062] but was:[N/A] at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions.testNoTrackingUrlSetRMAppPageComplete(TestRMAppAttemptTransitions.java:1252) The RM should point the tracking url to the RM app page if its empty Key: YARN-1106 URL: https://issues.apache.org/jira/browse/YARN-1106 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.1.0-beta, 0.23.9 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1106.patch, YARN-1106.patch It would be nice if the Resourcemanager set the tracking url to the RM app page if the application master doesn't pass one or passes the empty string. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1921) Allow to override queue prefix, where new queues will be created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965787#comment-13965787 ] Hadoop QA commented on YARN-1921: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639595/YARN-1921.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3546//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3546//console This message is automatically generated. Allow to override queue prefix, where new queues will be created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eat resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster
[ https://issues.apache.org/jira/browse/YARN-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965788#comment-13965788 ] Hadoop QA commented on YARN-1923: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639632/YARN-1923.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3547//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3547//console This message is automatically generated. Make FairScheduler resource ratio calculations terminate faster --- Key: YARN-1923 URL: https://issues.apache.org/jira/browse/YARN-1923 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-1923.patch In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1864) Fair Scheduler Dynamic Hierarchical User Queues
[ https://issues.apache.org/jira/browse/YARN-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965792#comment-13965792 ] Sandy Ryza commented on YARN-1864: -- [~octo47], a design goal is to be able to have multiple queues that have user-queues underneath. E.g. an administrator might want to be able to configure marketing and finance queues, and have queues based off of the submitter's username within each of those queues. If I understand correctly, your solution wouldn't accommodate this. Am I missing anything? Fair Scheduler Dynamic Hierarchical User Queues --- Key: YARN-1864 URL: https://issues.apache.org/jira/browse/YARN-1864 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Ashwin Shankar Labels: scheduler Attachments: YARN-1864-v1.txt In Fair Scheduler, we want to be able to create user queues under any parent queue in the hierarchy. For eg. Say user1 submits a job to a parent queue called root.allUserQueues, we want be able to create a new queue called root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted by this user to root.allUserQueues will be run in this newly created root.allUserQueues.user1. This is very similar to the 'user-as-default' feature in Fair Scheduler which creates user queues under root queue. But we want the ability to create user queues under ANY parent queue. Why do we want this ? 1. Preemption : these dynamically created user queues can preempt each other if its fair share is not met. So there is fairness among users. User queues can also preempt other non-user leaf queue as well if below fair share. 2. Allocation to user queues : we want all the user queries(adhoc) to consume only a fraction of resources in the shared cluster. By creating this feature,we could do that by giving a fair share to the parent user queue which is then redistributed to all the dynamically created user queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1582) Capacity Scheduler: add a maximum-allocation-mb setting per queue
[ https://issues.apache.org/jira/browse/YARN-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965799#comment-13965799 ] Chen He commented on YARN-1582: --- +1, patch looks good to me. Capacity Scheduler: add a maximum-allocation-mb setting per queue -- Key: YARN-1582 URL: https://issues.apache.org/jira/browse/YARN-1582 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 3.0.0, 0.23.10, 2.2.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1582-branch-0.23.patch We want to allow certain queues to use larger container sizes while limiting other queues to smaller container sizes. Setting it per queue will help prevent abuse, help limit the impact of reservations, and allow changes in the maximum container size to be rolled out more easily. One reason this is needed is more application types are becoming available on yarn and certain applications require more memory to run efficiently. While we want to allow for that we don't want other applications to abuse that and start requesting bigger containers then what they really need. Note that we could have this based on application type, but that might not be totally accurate either since for example you might want to allow certain users on MapReduce to use larger containers, while limiting other users of MapReduce to smaller containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1864) Fair Scheduler Dynamic Hierarchical User Queues
[ https://issues.apache.org/jira/browse/YARN-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965819#comment-13965819 ] Andrey Stepachev commented on YARN-1864: BTW, PrimaryGroup and SecondaryGroup rules can be modified with attribute useUserName, and if such attribute is true, rules will create queues based on submitter names, not group names. That it backward compatible too, so such changes will not hurt. Fair Scheduler Dynamic Hierarchical User Queues --- Key: YARN-1864 URL: https://issues.apache.org/jira/browse/YARN-1864 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Ashwin Shankar Labels: scheduler Attachments: YARN-1864-v1.txt In Fair Scheduler, we want to be able to create user queues under any parent queue in the hierarchy. For eg. Say user1 submits a job to a parent queue called root.allUserQueues, we want be able to create a new queue called root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted by this user to root.allUserQueues will be run in this newly created root.allUserQueues.user1. This is very similar to the 'user-as-default' feature in Fair Scheduler which creates user queues under root queue. But we want the ability to create user queues under ANY parent queue. Why do we want this ? 1. Preemption : these dynamically created user queues can preempt each other if its fair share is not met. So there is fairness among users. User queues can also preempt other non-user leaf queue as well if below fair share. 2. Allocation to user queues : we want all the user queries(adhoc) to consume only a fraction of resources in the shared cluster. By creating this feature,we could do that by giving a fair share to the parent user queue which is then redistributed to all the dynamically created user queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965833#comment-13965833 ] Varun Vasudev commented on YARN-1903: - +1, patch looks good. Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set --- Key: YARN-1903 URL: https://issues.apache.org/jira/browse/YARN-1903 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1903.1.patch The container status after stopping container is not expected. {code} java.lang.AssertionError: 4: at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1921) Allow to override queue prefix, where new queues will be created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Stepachev updated YARN-1921: --- Attachment: YARN-1921.patch Allow to override queue prefix, where new queues will be created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch, YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eat resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1864) Fair Scheduler Dynamic Hierarchical User Queues
[ https://issues.apache.org/jira/browse/YARN-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965855#comment-13965855 ] Andrey Stepachev commented on YARN-1864: I've updated patch with three additional rules 'userMatch', 'primaryGroupMatch', and 'secondaryGroupMatch'. Now testcase like this works: {code} @Test public void testSpecifiedUserPolicyWithPrefix() throws Exception { StringBuffer sb = new StringBuffer(); sb.append(queuePlacementPolicy); sb.append( rule name='specified' parent='granted'/); sb.append( rule name='userMatch' parent='admin' pattern='admin1|admin4'/); sb.append( rule name='primaryGroupMatch' parent='admin.primg' pattern='admin2group.*'/); sb.append( rule name='secondaryGroupMatch' parent='admin.secg' pattern='admin3subgroup1'/); sb.append( rule name='user' parent='guests'/); sb.append(/queuePlacementPolicy); QueuePlacementPolicy policy = parse(sb.toString()); assertEquals(root.granted.specifiedq,policy.assignAppToQueue(specifiedq, someuser)); assertEquals(root.admin.admin1, policy.assignAppToQueue(default, admin1)); assertEquals(root.admin.primg.admin2, policy.assignAppToQueue(default, admin2)); assertEquals(root.admin.secg.admin3, policy.assignAppToQueue(default, admin3)); assertEquals(root.guests.someuser, policy.assignAppToQueue(default, someuser)); assertEquals(root.guests.otheruser, policy.assignAppToQueue(default, otheruser)); {code} Fair Scheduler Dynamic Hierarchical User Queues --- Key: YARN-1864 URL: https://issues.apache.org/jira/browse/YARN-1864 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Ashwin Shankar Labels: scheduler Attachments: YARN-1864-v1.txt In Fair Scheduler, we want to be able to create user queues under any parent queue in the hierarchy. For eg. Say user1 submits a job to a parent queue called root.allUserQueues, we want be able to create a new queue called root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted by this user to root.allUserQueues will be run in this newly created root.allUserQueues.user1. This is very similar to the 'user-as-default' feature in Fair Scheduler which creates user queues under root queue. But we want the ability to create user queues under ANY parent queue. Why do we want this ? 1. Preemption : these dynamically created user queues can preempt each other if its fair share is not met. So there is fairness among users. User queues can also preempt other non-user leaf queue as well if below fair share. 2. Allocation to user queues : we want all the user queries(adhoc) to consume only a fraction of resources in the shared cluster. By creating this feature,we could do that by giving a fair share to the parent user queue which is then redistributed to all the dynamically created user queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1921) Allow to override queue prefix, where new queues will be created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Stepachev updated YARN-1921: --- Attachment: (was: YARN-1921.patch) Allow to override queue prefix, where new queues will be created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch, YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eat resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1921) Allow to override queue prefix, where new queues will be created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Stepachev updated YARN-1921: --- Attachment: YARN-1921.patch Allow to override queue prefix, where new queues will be created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch, YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eat resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT
[ https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965877#comment-13965877 ] Hadoop QA commented on YARN-126: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580129/YARN-126.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3549//console This message is automatically generated. yarn rmadmin help message contains reference to hadoop cli and JT - Key: YARN-126 URL: https://issues.apache.org/jira/browse/YARN-126 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.0.3-alpha Reporter: Thomas Graves Assignee: Rémy SAISSY Labels: usability Attachments: YARN-126.patch has option to specify a job tracker and the last line for general command line syntax had bin/hadoop command [genericOptions] [commandOptions] ran yarn rmadmin to get usage: RMAdmin Usage: java RMAdmin [-refreshQueues] [-refreshNodes] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-refreshAdminAcls] [-refreshServiceAcl] [-help [cmd]] Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
[ https://issues.apache.org/jira/browse/YARN-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965886#comment-13965886 ] Arpit Gupta commented on YARN-1924: --- Here is the stack trace. {code} cheduler from user hrt_qa in queue default 2014-04-10 09:19:35,907 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(659)) - appattempt_1397121188061_0004_02 State change from SUBMITTED to SCHEDULED 2014-04-10 09:19:36,095 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(639)) - application_1397121188061_0004 State change from ACCEPTED to KILLING 2014-04-10 09:19:36,096 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(986)) - Updating application attempt appattempt_1397121188061_0004_02 with final state: KILLED 2014-04-10 09:19:36,096 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(659)) - appattempt_1397121188061_0004_02 State change from SCHEDULED to FINAL_SAVING 2014-04-10 09:19:36,103 ERROR recovery.RMStateStore (RMStateStore.java:handleStoreEvent(681)) - Error storing appAttempt: appattempt_1397121188061_0004_02 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:834) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:831) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:930) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:949) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:831) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:845) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:862) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:604) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-04-10 09:19:36,107 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(657)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:834) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:831) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:930) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:949) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:831) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:845) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:862) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:604) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at
[jira] [Created] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
Arpit Gupta created YARN-1924: - Summary: RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED Key: YARN-1924 URL: https://issues.apache.org/jira/browse/YARN-1924 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Noticed on a HA cluster Both RM shut down with this error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1925) TestSpeculativeExecutionWithMRApp fails
[ https://issues.apache.org/jira/browse/YARN-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1925: -- Labels: test (was: ) TestSpeculativeExecutionWithMRApp fails --- Key: YARN-1925 URL: https://issues.apache.org/jira/browse/YARN-1925 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: test {code} junit.framework.AssertionFailedError: Couldn't speculate successfully at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.assertTrue(Assert.java:20) at org.apache.hadoop.mapreduce.v2.TestSpeculativeExecutionWithMRApp.testSpeculateSuccessfulWithoutUpdateEvents(TestSpeculativeExecutionWithMRApp.java:122 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1925) TestSpeculativeExecutionWithMRApp fails
Zhijie Shen created YARN-1925: - Summary: TestSpeculativeExecutionWithMRApp fails Key: YARN-1925 URL: https://issues.apache.org/jira/browse/YARN-1925 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen {code} junit.framework.AssertionFailedError: Couldn't speculate successfully at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.assertTrue(Assert.java:20) at org.apache.hadoop.mapreduce.v2.TestSpeculativeExecutionWithMRApp.testSpeculateSuccessfulWithoutUpdateEvents(TestSpeculativeExecutionWithMRApp.java:122 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
[ https://issues.apache.org/jira/browse/YARN-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965897#comment-13965897 ] Jian He commented on YARN-1924: --- Thanks Arpit for reporting this issue. The problem is that if we kill the application when the app is at submitted state. The app will try to save the final state before the initial state is saved, which causing no-node-exist exception. Changed the ZK updateState API to check if the node exists. If it exists, do set operation, otherwise do create operation. RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED Key: YARN-1924 URL: https://issues.apache.org/jira/browse/YARN-1924 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1924.1.patch Noticed on a HA cluster Both RM shut down with this error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
[ https://issues.apache.org/jira/browse/YARN-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1924: -- Attachment: YARN-1924.1.patch RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED Key: YARN-1924 URL: https://issues.apache.org/jira/browse/YARN-1924 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1924.1.patch Noticed on a HA cluster Both RM shut down with this error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1921) Allow to override queue prefix, where new queues will be created
[ https://issues.apache.org/jira/browse/YARN-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965913#comment-13965913 ] Hadoop QA commented on YARN-1921: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639659/YARN-1921.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3548//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3548//console This message is automatically generated. Allow to override queue prefix, where new queues will be created Key: YARN-1921 URL: https://issues.apache.org/jira/browse/YARN-1921 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.3.0 Environment: Yarn 2.3.0 Reporter: Andrey Stepachev Attachments: YARN-1921.patch, YARN-1921.patch Fair scheduler has a couple of QueuePlacementRules. Those rules can create queues, if they not exists with hardcoded prefix root.. Consider an example: we have a placement rule, which creates user's queue if it not exists. Current implementation creates it at root. prefix Suppose that this user runs a big job. In that case it will get a fair share of resources because queue will be created at 'root.' with default settings, and that affects all other users of the cluster. Of course, FairScheduler can place such users to default queue, but in that case if user submits a big queue it will eat resources of whole queue, and we know that no preemption can be done within one queue (Or i'm wrong?). So effectively one user can usurp all default queue resources. To solve that I created a patch, which allows to override root. prefix in QueuePlacementRules. Thats gives us flexibility to automatically create queues for users or group of users under predefined queue. So, every user will get a separate queue and will share parent queue resources and can't usurp all resources, because parent node can be configured to preempt tasks. Consider example (parent queue specified for each rule): {code:title=policy.xml|borderStyle=solid} queuePlacementPolicy rule name='specified' parent='granted'/ rule name='user' parent='guests'/ /queuePlacementPolicy {code} With such definition queue requirements will give us: {code:title=Example.java|borderStyle=solid} root.granted.specifiedq == policy.assignAppToQueue(specifiedq, someuser); root.guests.someuser == policy.assignAppToQueue(default, someuser); root.guests.otheruser == policy.assignAppToQueue(default, otheruser); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster
[ https://issues.apache.org/jira/browse/YARN-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965919#comment-13965919 ] Sandy Ryza commented on YARN-1923: -- A couple nits. Otherwise, LGTM. There's a false whitespace change. Also, {code} + } + else if (resourceRatio {code} else should be on the same line as the closing curly brace. resourceRatio isn't really an accurate name for the variable. resourceUsed might make more sense. Make FairScheduler resource ratio calculations terminate faster --- Key: YARN-1923 URL: https://issues.apache.org/jira/browse/YARN-1923 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-1923.patch In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1304) Error starting AM: org.apache.hadoop.security.token.TokenIdentifier: Error reading configuration file
[ https://issues.apache.org/jira/browse/YARN-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965966#comment-13965966 ] Josh Elser commented on YARN-1304: -- Just ran into this one myself. In my case, I had inadvertently updated some hadoop jars after the NM was already running. When an MR went to look at the classpath, some of the jars weren't on local filesystem anymore, and it bailed out. Restarting the NM and verifying that all hadoop jars were present and readable were sufficient for me to work around this. Probably worthwhile to close this out without any more info. Error starting AM: org.apache.hadoop.security.token.TokenIdentifier: Error reading configuration file - Key: YARN-1304 URL: https://issues.apache.org/jira/browse/YARN-1304 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Bikas Saha -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
[ https://issues.apache.org/jira/browse/YARN-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965974#comment-13965974 ] Karthik Kambatla commented on YARN-1924: Thanks Jian. I, myself, ran into this once before when the HA work wasn't as stable. Let me take a look. RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED Key: YARN-1924 URL: https://issues.apache.org/jira/browse/YARN-1924 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1924.1.patch Noticed on a HA cluster Both RM shut down with this error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1612) Change Fair Scheduler to not disable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1612: -- Target Version/s: 3.0.0 (was: 3.0.0, 0.23.10) Change Fair Scheduler to not disable delay scheduling by default Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster
[ https://issues.apache.org/jira/browse/YARN-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965988#comment-13965988 ] Anubhav Dhoot commented on YARN-1923: - Would plannedResourceUsed or estimatedResourceUsed be better? Make FairScheduler resource ratio calculations terminate faster --- Key: YARN-1923 URL: https://issues.apache.org/jira/browse/YARN-1923 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-1923.patch In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965992#comment-13965992 ] Jian He commented on YARN-1903: --- Looks good, one more suggestion, since the process is never started, we can add one more diagnostics to clarify that this is a not started container process. Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set --- Key: YARN-1903 URL: https://issues.apache.org/jira/browse/YARN-1903 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1903.1.patch The container status after stopping container is not expected. {code} java.lang.AssertionError: 4: at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
[ https://issues.apache.org/jira/browse/YARN-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966011#comment-13966011 ] Hadoop QA commented on YARN-1924: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639665/YARN-1924.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3550//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3550//console This message is automatically generated. RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED Key: YARN-1924 URL: https://issues.apache.org/jira/browse/YARN-1924 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1924.1.patch Noticed on a HA cluster Both RM shut down with this error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966013#comment-13966013 ] Jian He commented on YARN-1879: --- sorry for the late response, was caught up with other things, will take a look in the next couple of days. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1864) Fair Scheduler Dynamic Hierarchical User Queues
[ https://issues.apache.org/jira/browse/YARN-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966024#comment-13966024 ] Ashwin Shankar commented on YARN-1864: -- Hi [~octo47], We want to be able to support user queues for any rule(now and in the future) without needing to add code in that rule. I had a discussion with Sandy last week and we felt implementing hierarchicalUserQueues using nested rules would make it a) extensible and b) clean. For eg if we want user queues for Primary group,all we have to do is : rule = hierarhicalUserQueue rule=primaryGroup / /rule The nested rule would be applied first and based on what it returns we can make a decision at the HUQ level whether to put it in a user queue underneath or skip to the next rule. By this way we can also have 'create' flag at both hierarchicalUserQueue level and the nest rule level which gives the admin granularity of control over creating new queues. If someone writes any other rule in future and want user queue support,they just have to nest it with HUQ rule and things would just work. No extra attributes in the xml is needed in the new rule. As part of this patch, another thing I'm writing is to be able to mention parent queues without leaf queues in the alloc xml, which can then be used as user queues. I'm done with code,will write tests and post a patch with all these features this week. Fair Scheduler Dynamic Hierarchical User Queues --- Key: YARN-1864 URL: https://issues.apache.org/jira/browse/YARN-1864 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Ashwin Shankar Labels: scheduler Attachments: YARN-1864-v1.txt In Fair Scheduler, we want to be able to create user queues under any parent queue in the hierarchy. For eg. Say user1 submits a job to a parent queue called root.allUserQueues, we want be able to create a new queue called root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted by this user to root.allUserQueues will be run in this newly created root.allUserQueues.user1. This is very similar to the 'user-as-default' feature in Fair Scheduler which creates user queues under root queue. But we want the ability to create user queues under ANY parent queue. Why do we want this ? 1. Preemption : these dynamically created user queues can preempt each other if its fair share is not met. So there is fairness among users. User queues can also preempt other non-user leaf queue as well if below fair share. 2. Allocation to user queues : we want all the user queries(adhoc) to consume only a fraction of resources in the shared cluster. By creating this feature,we could do that by giving a fair share to the parent user queue which is then redistributed to all the dynamically created user queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster
[ https://issues.apache.org/jira/browse/YARN-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1923: Attachment: YARN-1923.002.patch Addressed feedback Make FairScheduler resource ratio calculations terminate faster --- Key: YARN-1923 URL: https://issues.apache.org/jira/browse/YARN-1923 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-1923.002.patch, YARN-1923.patch In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
[ https://issues.apache.org/jira/browse/YARN-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966029#comment-13966029 ] Hadoop QA commented on YARN-1924: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639665/YARN-1924.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3551//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3551//console This message is automatically generated. RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED Key: YARN-1924 URL: https://issues.apache.org/jira/browse/YARN-1924 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1924.1.patch Noticed on a HA cluster Both RM shut down with this error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1903: -- Attachment: YARN-1903.2.patch Thanks for review! Patch is updated. Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set --- Key: YARN-1903 URL: https://issues.apache.org/jira/browse/YARN-1903 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1903.1.patch, YARN-1903.2.patch The container status after stopping container is not expected. {code} java.lang.AssertionError: 4: at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
[ https://issues.apache.org/jira/browse/YARN-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966055#comment-13966055 ] Zhijie Shen commented on YARN-1924: --- The change should fix the bug, and the patch looks good to me almost. Two nits: 1. As the log message has been changed in RMStateStore, it's good to say storing or updating in the corresponding methods in FS and Memory impl 2. One typo, and should it be error level log? {code} + LOG.info(Error while doing ZK operaion., ke); {code} RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED Key: YARN-1924 URL: https://issues.apache.org/jira/browse/YARN-1924 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1924.1.patch Noticed on a HA cluster Both RM shut down with this error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1924) RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED
[ https://issues.apache.org/jira/browse/YARN-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1924: -- Attachment: YARN-1924.2.patch Updated the patch accordingly. Thanks for Zhijie and Karthik for taking a look. RM shut down with RMFatalEvent of type STATE_STORE_OP_FAILED Key: YARN-1924 URL: https://issues.apache.org/jira/browse/YARN-1924 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1924.1.patch, YARN-1924.2.patch Noticed on a HA cluster Both RM shut down with this error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966081#comment-13966081 ] Jian He commented on YARN-1903: --- +1, will commit once Jenkins returns. Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set --- Key: YARN-1903 URL: https://issues.apache.org/jira/browse/YARN-1903 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1903.1.patch, YARN-1903.2.patch The container status after stopping container is not expected. {code} java.lang.AssertionError: 4: at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966083#comment-13966083 ] Hadoop QA commented on YARN-1903: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639694/YARN-1903.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3553//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3553//console This message is automatically generated. Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set --- Key: YARN-1903 URL: https://issues.apache.org/jira/browse/YARN-1903 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1903.1.patch, YARN-1903.2.patch The container status after stopping container is not expected. {code} java.lang.AssertionError: 4: at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster
[ https://issues.apache.org/jira/browse/YARN-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966088#comment-13966088 ] Hadoop QA commented on YARN-1923: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639691/YARN-1923.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3552//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3552//console This message is automatically generated. Make FairScheduler resource ratio calculations terminate faster --- Key: YARN-1923 URL: https://issues.apache.org/jira/browse/YARN-1923 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-1923.002.patch, YARN-1923.patch In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal -- This message was sent by Atlassian JIRA (v6.2#6252)