[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129743#comment-14129743 ] Zhijie Shen commented on YARN-2527: --- bq. But I believe, the NullPointer issue in ApplicationACLsManager should be fixed regardless of that. My concern is that each submitted app should have a map of acls in the acls manager, even if it is empty. If we make this change, we may hide some potential bugs. I looked into RMAppManager.createAndPopulateNewRMApp: app is put into RMContext first and then its acls is put into ApplicationACLsManager. In AppBlock, app is obtained from RMContext, and then its acls is going to be checked. If this happens after app is put into RMContext, but its acls isn't put into ApplicationACLsManager yet, user should hit the NPE case. Would you please see whether it is the race condition that you ran into before? NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129755#comment-14129755 ] Remus Rusanu commented on YARN-1063: I have tested against today trunk and the patch is identical, no need for any rebase on this. Winutils needs ability to create task as domain user Key: YARN-1063 URL: https://issues.apache.org/jira/browse/YARN-1063 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Environment: Windows Reporter: Kyle Leckie Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch h1. Summary: Securing a Hadoop cluster requires constructing some form of security boundary around the processes executed in YARN containers. Isolation based on Windows user isolation seems most feasible. This approach is similar to the approach taken by the existing LinuxContainerExecutor. The current patch to winutils.exe adds the ability to create a process as a domain user. h1. Alternative Methods considered: h2. Process rights limited by security token restriction: On Windows access decisions are made by examining the security token of a process. It is possible to spawn a process with a restricted security token. Any of the rights granted by SIDs of the default token may be restricted. It is possible to see this in action by examining the security tone of a sandboxed process launch be a web browser. Typically the launched process will have a fully restricted token and need to access machine resources through a dedicated broker process that enforces a custom security policy. This broker process mechanism would break compatibility with the typical Hadoop container process. The Container process must be able to utilize standard function calls for disk and network IO. I performed some work looking at ways to ACL the local files to the specific launched without granting rights to other processes launched on the same machine but found this to be an overly complex solution. h2. Relying on APP containers: Recent versions of windows have the ability to launch processes within an isolated container. Application containers are supported for execution of WinRT based executables. This method was ruled out due to the lack of official support for standard windows APIs. At some point in the future windows may support functionality similar to BSD jails or Linux containers, at that point support for containers should be added. h1. Create As User Feature Description: h2. Usage: A new sub command was added to the set of task commands. Here is the syntax: winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] Some notes: * The username specified is in the format of user@domain * The machine executing this command must be joined to the domain of the user specified * The domain controller must allow the account executing the command access to the user information. For this join the account to the predefined group labeled Pre-Windows 2000 Compatible Access * The account running the command must have several rights on the local machine. These can be managed manually using secpol.msc: ** Act as part of the operating system - SE_TCB_NAME ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME * The launched process will not have rights to the desktop so will not be able to display any information or create UI. * The launched process will have no network credentials. Any access of network resources that requires domain authentication will fail. h2. Implementation: Winutils performs the following steps: # Enable the required privileges for the current process. # Register as a trusted process with the Local Security Authority (LSA). # Create a new logon for the user passed on the command line. # Load/Create a profile on the local machine for the new logon. # Create a new environment for the new logon. # Launch the new process in a job with the task name specified and using the created logon. # Wait for the JOB to exit. h2. Future work: The following work was scoped out of this check in: * Support for non-domain users or machine that are not domain joined. * Support for privilege isolation by running the task launcher in a high privilege service with access over an ACLed named pipe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129761#comment-14129761 ] Hadoop QA commented on YARN-611: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667941/YARN-611.9.rebase.patch against trunk revision 4be9517. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4884//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4884//console This message is automatically generated. Add an AM retry count reset window to YARN RM - Key: YARN-611 URL: https://issues.apache.org/jira/browse/YARN-611 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Chris Riccomini Assignee: Xuan Gong Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch, YARN-611.9.rebase.patch YARN currently has the following config: yarn.resourcemanager.am.max-retries This config defaults to 2, and defines how many times to retry a failed AM before failing the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the NM will timeout, which counts as a failure for the AM), or if the AM dies. This configuration is insufficient for long running (or infinitely running) YARN jobs, since the machine (or NM) that the AM is running on will eventually need to be restarted (or the machine/NM will fail). In such an event, the AM has not done anything wrong, but this is counted as a failure by the RM. Since the retry count for the AM is never reset, eventually, at some point, the number of machine/NM failures will result in the AM failure count going above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the job as failed, and shut it down. This behavior is not ideal. I propose that we add a second configuration: yarn.resourcemanager.am.retry-count-window-ms This configuration would define a window of time that would define when an AM is well behaved, and it's safe to reset its failure count back to zero. Every time an AM fails the RmAppImpl would check the last time that the AM failed. If the last failure was less than retry-count-window-ms ago, and the new failure count is max-retries, then the job should fail. If the AM has never failed, the retry count is max-retries, or if the last failure was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if the last failure was outside the retry-count-window-ms, then the failure count should be set back to 0. This would give developers a way to have well-behaved AMs run forever, while still failing mis-behaving AMs after a short period of time. I think the work to be done here is to change the RmAppImpl to actually look at app.attempts, and see if there have been more than max-retries failures in the last retry-count-window-ms milliseconds. If there have, then the job should fail, if not, then the job should go forward. Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of
[jira] [Commented] (YARN-1912) ResourceLocalizer started without any jvm memory control
[ https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129799#comment-14129799 ] Guo Ruijing commented on YARN-1912: --- We may add configurable parameter for options in yarn-default.xml as: property descriptionWhen nodemanager tries to localize the resources for linux containers, it will use the following JVM opts to launch the linux container localizer. /description nameyarn.nodemanager.linux-container-localizer.opts/name value-Xmx1024m/value /property ResourceLocalizer started without any jvm memory control Key: YARN-1912 URL: https://issues.apache.org/jira/browse/YARN-1912 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: stanley shi Attachments: YARN-1912-0.patch, YARN-1912-1.patch In the LinuxContainerExecutor.java#startLocalizer, it does not specify any -Xmx configurations in the command, this caused the ResourceLocalizer to be started with default memory setting. In an server-level hardware, it will use 25% of the system memory as the max heap size, this will cause memory issue in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-1972: --- Attachment: YARN-1972.delta.4.patch Patch delta.4 is delta from YARN-1063, rebased to today trunk Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators group or LocalSystem. this is derived from the need to invoke LoadUserProfile API which mention these requirements in the specifications. This is in addition to the SE_TCB privilege mentioned in YARN-1063, but this requirement will automatically imply that the SE_TCB privilege is held by the nodemanager. For the Linux speakers in the audience, the requirement is basically to run NM as root. h2. Dedicated high privilege Service Due to the high privilege required by the WCE we had discussed the need to isolate the high privilege operations into a separate process, an 'executor' service that is solely responsible to start the containers (incloding the localizer). The NM would have to authenticate, authorize and communicate with this service via an IPC mechanism and use this service to launch the containers. I still believe we'll end up deploying such a service, but the effort to onboard such a new platfrom specific new service on the project are not trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-1972: --- Attachment: YARN-1972.trunk.4.patch patch trunk.4 is diff from trunk, for Jenkins to evaluate Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch, YARN-1972.trunk.4.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators group or LocalSystem. this is derived from the need to invoke LoadUserProfile API which mention these requirements in the specifications. This is in addition to the SE_TCB privilege mentioned in YARN-1063, but this requirement will automatically imply that the SE_TCB privilege is held by the nodemanager. For the Linux speakers in the audience, the requirement is basically to run NM as root. h2. Dedicated high privilege Service Due to the high privilege required by the WCE we had discussed the need to isolate the high privilege operations into a separate process, an 'executor' service that is solely responsible to start the containers (incloding the localizer). The NM would have to authenticate, authorize and communicate with this service via an IPC mechanism and use this service to launch the containers. I still believe we'll end up deploying such a service, but the effort to onboard such a new platfrom specific new service on the project are not trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1912) ResourceLocalizer started without any jvm memory control
[ https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129812#comment-14129812 ] Hadoop QA commented on YARN-1912: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642478/YARN-1912-1.patch against trunk revision 4be9517. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4885//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4885//console This message is automatically generated. ResourceLocalizer started without any jvm memory control Key: YARN-1912 URL: https://issues.apache.org/jira/browse/YARN-1912 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: stanley shi Attachments: YARN-1912-0.patch, YARN-1912-1.patch In the LinuxContainerExecutor.java#startLocalizer, it does not specify any -Xmx configurations in the command, this caused the ResourceLocalizer to be started with default memory setting. In an server-level hardware, it will use 25% of the system memory as the max heap size, this will cause memory issue in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129837#comment-14129837 ] Wangda Tan commented on YARN-2496: -- Hi [~jianhe], Thanks for your comments, bq. CSQueueUtils.java format change only, we can revert Reverted bq. why checking labelManager != null every where ? we only need to check where it’s needed. It was used to reduce changes in tests. I think we should remove these checks and improve tests bq. We may not need to change the method signature to add one more parameter, just pass the queues map into NodeLabelManager#reinitializeQueueLabels, to avoid a number of test changes. Make sense, now reverted changes for related tests and get a queueToLabels after parseQueue bq. label initialization code is duplicated between ParentQueue and LeafQueue, how about creating an AbstractCSQueue and put common initialization methods there ? Make sense, there do have lots of common code between PQ and LQ, now I have merged all common parts to abstractCSQueue. Attached a new patch Thanks, Wangda [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2496: - Attachment: YARN-2496.patch [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129845#comment-14129845 ] Hadoop QA commented on YARN-2496: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667992/YARN-2496.patch against trunk revision 4be9517. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4887//console This message is automatically generated. [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129854#comment-14129854 ] Hadoop QA commented on YARN-1972: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667974/YARN-1972.trunk.4.patch against trunk revision 4be9517. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.yarn.server.nodemanager.TestLinuxContainerExecutorWithMocks {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4886//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4886//console This message is automatically generated. Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch, YARN-1972.trunk.4.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators
[jira] [Updated] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2496: - Attachment: YARN-2496.patch [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
[ https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2500: - Attachment: YARN-2500.patch [YARN-796] Miscellaneous changes in ResourceManager to support labels - Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2500.patch, YARN-2500.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129875#comment-14129875 ] Hadoop QA commented on YARN-2496: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668003/YARN-2496.patch against trunk revision 4be9517. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4888//console This message is automatically generated. [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
[ https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129877#comment-14129877 ] Hadoop QA commented on YARN-2500: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668004/YARN-2500.patch against trunk revision 4be9517. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4889//console This message is automatically generated. [YARN-796] Miscellaneous changes in ResourceManager to support labels - Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2500.patch, YARN-2500.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129904#comment-14129904 ] Hudson commented on YARN-1458: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #677 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/677/]) YARN-1458. FairScheduler: Zero weight can lead to livelock. (Zhihai Xu via kasha) (kasha: rev 3072c83b38fd87318d502a7d1bc518963b5ccdf7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/ComputeFairShares.java * hadoop-yarn-project/CHANGES.txt FairScheduler: Zero weight can lead to livelock --- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129908#comment-14129908 ] Hudson commented on YARN-415: - SUCCESS: Integrated in Hadoop-Yarn-trunk #677 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/677/]) YARN-415. Capture aggregate memory allocation at the app-level for chargeback. Contributed by Eric Payne Andrey Klochkov (jianhe: rev 83be3ad44484bf8a24cb90de4b9c26ab59d226a8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/proto/yarn_server_resourcemanager_recovery.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationResourceUsageReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestContainerResourceUsage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/AggregateAppResourceUsage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestYarnCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestClientRMService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebAppFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/BuilderUtils.java *
[jira] [Commented] (YARN-2448) RM should expose the resource types considered during scheduling when AMs register
[ https://issues.apache.org/jira/browse/YARN-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129911#comment-14129911 ] Hudson commented on YARN-2448: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #677 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/677/]) YARN-2448. Changed ApplicationMasterProtocol to expose RM-recognized resource types to the AMs. Contributed by Varun Vasudev. (vinodkv: rev b67d5ba7842cc10695d987f217027848a5a8c3d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/RegisterApplicationMasterResponsePBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/RegisterApplicationMasterResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java RM should expose the resource types considered during scheduling when AMs register -- Key: YARN-2448 URL: https://issues.apache.org/jira/browse/YARN-2448 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2448.0.patch, apache-yarn-2448.1.patch, apache-yarn-2448.2.patch The RM should expose the name of the ResourceCalculator being used when AMs register, as part of the RegisterApplicationMasterResponse. This will allow applications to make better decisions when scheduling. MapReduce for example, only looks at memory when deciding it's scheduling, even though the RM could potentially be using the DominantResourceCalculator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129907#comment-14129907 ] Hudson commented on YARN-2158: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #677 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/677/]) YARN-2158. Fixed TestRMWebServicesAppsModification#testSingleAppKill test failure. Contributed by Varun Vasudev (jianhe: rev cbfe26370b85161c79fdd48bf69c95d5725d8f6a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * hadoop-yarn-project/CHANGES.txt TestRMWebServicesAppsModification sometimes fails in trunk -- Key: YARN-2158 URL: https://issues.apache.org/jira/browse/YARN-2158 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Varun Vasudev Priority: Minor Fix For: 2.6.0 Attachments: apache-yarn-2158.0.patch, apache-yarn-2158.1.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : {code} Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.297 sec FAILURE! java.lang.AssertionError: app state incorrect at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129906#comment-14129906 ] Hudson commented on YARN-2440: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #677 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/677/]) YARN-2440. Enabled Nodemanagers to limit the aggregate cpu usage across all containers to a preconfigured limit. Contributed by Varun Vasudev. (vinodkv: rev 4be95175cdb58ff12a27ab443d609d3b46da7bfa) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestNodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/NodeManagerHardwareUtils.java Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129910#comment-14129910 ] Hudson commented on YARN-2459: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #677 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/677/]) YARN-2459. RM crashes if App gets rejected for any reason and HA is enabled. Contributed by Jian He (xgong: rev 47bdfa044aa1d587b24edae8b1b0c796d829c960) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java Fix CHANGES.txt. Credit Mayank Bansal for his contributions on YARN-2459 (xgong: rev 7d38ffc8d3500d428bdad5640e9e70d66ed5ea60) * hadoop-yarn-project/CHANGES.txt RM crashes if App gets rejected for any reason and HA is enabled Key: YARN-2459 URL: https://issues.apache.org/jira/browse/YARN-2459 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 2.6.0 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch If RM HA is enabled and used Zookeeper store for RM State Store. If for any reason Any app gets rejected and directly goes to NEW to FAILED then final transition makes that to RMApps and Completed Apps memory structure but that doesn't make it to State store. Now when RMApps default limit reaches it starts deleting apps from memory and store. In that case it try to delete this app from store and fails which causes RM to crash. Thanks, Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2531) CGroups - Admins should be allowed to enforce strict cpu limits
[ https://issues.apache.org/jira/browse/YARN-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2531: Attachment: apache-yarn-2531.0.patch Uploaded patch with added config. CGroups - Admins should be allowed to enforce strict cpu limits --- Key: YARN-2531 URL: https://issues.apache.org/jira/browse/YARN-2531 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2531.0.patch From YARN-2440 - {quote} The other dimension to this is determinism w.r.t performance. Limiting to allocated cores overall (as well as per container later) helps orgs run workloads and reason about them deterministically. One of the examples is benchmarking apps, but deterministic execution is a desired option beyond benchmarks too. {quote} It would be nice to have an option to let admins to enforce strict cpu limits for apps for things like benchmarking, etc. By default this flag should be off so that containers can use available cpu but admin can turn the flag on to determine worst case performance, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2531) CGroups - Admins should be allowed to enforce strict cpu limits
[ https://issues.apache.org/jira/browse/YARN-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129974#comment-14129974 ] Hadoop QA commented on YARN-2531: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668046/apache-yarn-2531.0.patch against trunk revision 4be9517. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4890//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4890//console This message is automatically generated. CGroups - Admins should be allowed to enforce strict cpu limits --- Key: YARN-2531 URL: https://issues.apache.org/jira/browse/YARN-2531 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2531.0.patch From YARN-2440 - {quote} The other dimension to this is determinism w.r.t performance. Limiting to allocated cores overall (as well as per container later) helps orgs run workloads and reason about them deterministically. One of the examples is benchmarking apps, but deterministic execution is a desired option beyond benchmarks too. {quote} It would be nice to have an option to let admins to enforce strict cpu limits for apps for things like benchmarking, etc. By default this flag should be off so that containers can use available cpu but admin can turn the flag on to determine worst case performance, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: YARN-2198.delta.4.patch delta.4.patch is based on YARN-1972. Is rebased to current trunk and integrates YARN-2458 as well. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.separation.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: YARN-2198.trunk.4.patch trunk.4.patch is same as delta.4.patch but is diff from trunk Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.separation.patch, YARN-2198.trunk.4.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129983#comment-14129983 ] Hadoop QA commented on YARN-2198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668056/YARN-2198.trunk.4.patch against trunk revision 4be9517. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4891//console This message is automatically generated. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.separation.patch, YARN-2198.trunk.4.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130032#comment-14130032 ] Hudson commented on YARN-2459: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1893 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1893/]) YARN-2459. RM crashes if App gets rejected for any reason and HA is enabled. Contributed by Jian He (xgong: rev 47bdfa044aa1d587b24edae8b1b0c796d829c960) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java Fix CHANGES.txt. Credit Mayank Bansal for his contributions on YARN-2459 (xgong: rev 7d38ffc8d3500d428bdad5640e9e70d66ed5ea60) * hadoop-yarn-project/CHANGES.txt RM crashes if App gets rejected for any reason and HA is enabled Key: YARN-2459 URL: https://issues.apache.org/jira/browse/YARN-2459 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 2.6.0 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch If RM HA is enabled and used Zookeeper store for RM State Store. If for any reason Any app gets rejected and directly goes to NEW to FAILED then final transition makes that to RMApps and Completed Apps memory structure but that doesn't make it to State store. Now when RMApps default limit reaches it starts deleting apps from memory and store. In that case it try to delete this app from store and fails which causes RM to crash. Thanks, Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130026#comment-14130026 ] Hudson commented on YARN-1458: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1893 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1893/]) YARN-1458. FairScheduler: Zero weight can lead to livelock. (Zhihai Xu via kasha) (kasha: rev 3072c83b38fd87318d502a7d1bc518963b5ccdf7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/ComputeFairShares.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/CHANGES.txt FairScheduler: Zero weight can lead to livelock --- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130028#comment-14130028 ] Hudson commented on YARN-2440: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1893 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1893/]) YARN-2440. Enabled Nodemanagers to limit the aggregate cpu usage across all containers to a preconfigured limit. Contributed by Varun Vasudev. (vinodkv: rev 4be95175cdb58ff12a27ab443d609d3b46da7bfa) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/NodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestNodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130030#comment-14130030 ] Hudson commented on YARN-415: - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1893 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1893/]) YARN-415. Capture aggregate memory allocation at the app-level for chargeback. Contributed by Eric Payne Andrey Klochkov (jianhe: rev 83be3ad44484bf8a24cb90de4b9c26ab59d226a8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/ApplicationAttemptStateData.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/proto/yarn_server_resourcemanager_recovery.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestContainerResourceUsage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestClientRMService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationResourceUsageReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationResourceUsageReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestYarnCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/impl/pb/ApplicationAttemptStateDataPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/BuilderUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/AggregateAppResourceUsage.java *
[jira] [Commented] (YARN-2448) RM should expose the resource types considered during scheduling when AMs register
[ https://issues.apache.org/jira/browse/YARN-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130033#comment-14130033 ] Hudson commented on YARN-2448: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1893 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1893/]) YARN-2448. Changed ApplicationMasterProtocol to expose RM-recognized resource types to the AMs. Contributed by Varun Vasudev. (vinodkv: rev b67d5ba7842cc10695d987f217027848a5a8c3d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/RegisterApplicationMasterResponsePBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/RegisterApplicationMasterResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto RM should expose the resource types considered during scheduling when AMs register -- Key: YARN-2448 URL: https://issues.apache.org/jira/browse/YARN-2448 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2448.0.patch, apache-yarn-2448.1.patch, apache-yarn-2448.2.patch The RM should expose the name of the ResourceCalculator being used when AMs register, as part of the RegisterApplicationMasterResponse. This will allow applications to make better decisions when scheduling. MapReduce for example, only looks at memory when deciding it's scheduling, even though the RM could potentially be using the DominantResourceCalculator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130029#comment-14130029 ] Hudson commented on YARN-2158: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1893 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1893/]) YARN-2158. Fixed TestRMWebServicesAppsModification#testSingleAppKill test failure. Contributed by Varun Vasudev (jianhe: rev cbfe26370b85161c79fdd48bf69c95d5725d8f6a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java TestRMWebServicesAppsModification sometimes fails in trunk -- Key: YARN-2158 URL: https://issues.apache.org/jira/browse/YARN-2158 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Varun Vasudev Priority: Minor Fix For: 2.6.0 Attachments: apache-yarn-2158.0.patch, apache-yarn-2158.1.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : {code} Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.297 sec FAILURE! java.lang.AssertionError: app state incorrect at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2033: - Attachment: YARN-2033.13.patch The .12 patch cannot apply anymore due to conflict with latest changes on trunk. Rebase the patch to latest trunk branch and do slightly update in .13. Committing it pending on Jenkins' test. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, YARN-2033.12.patch, YARN-2033.13.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130049#comment-14130049 ] Hudson commented on YARN-1458: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1868 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1868/]) YARN-1458. FairScheduler: Zero weight can lead to livelock. (Zhihai Xu via kasha) (kasha: rev 3072c83b38fd87318d502a7d1bc518963b5ccdf7) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/ComputeFairShares.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java FairScheduler: Zero weight can lead to livelock --- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130052#comment-14130052 ] Hudson commented on YARN-2158: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1868 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1868/]) YARN-2158. Fixed TestRMWebServicesAppsModification#testSingleAppKill test failure. Contributed by Varun Vasudev (jianhe: rev cbfe26370b85161c79fdd48bf69c95d5725d8f6a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * hadoop-yarn-project/CHANGES.txt TestRMWebServicesAppsModification sometimes fails in trunk -- Key: YARN-2158 URL: https://issues.apache.org/jira/browse/YARN-2158 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Varun Vasudev Priority: Minor Fix For: 2.6.0 Attachments: apache-yarn-2158.0.patch, apache-yarn-2158.1.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : {code} Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.297 sec FAILURE! java.lang.AssertionError: app state incorrect at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2448) RM should expose the resource types considered during scheduling when AMs register
[ https://issues.apache.org/jira/browse/YARN-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130056#comment-14130056 ] Hudson commented on YARN-2448: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1868 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1868/]) YARN-2448. Changed ApplicationMasterProtocol to expose RM-recognized resource types to the AMs. Contributed by Varun Vasudev. (vinodkv: rev b67d5ba7842cc10695d987f217027848a5a8c3d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/RegisterApplicationMasterResponse.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/RegisterApplicationMasterResponsePBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto RM should expose the resource types considered during scheduling when AMs register -- Key: YARN-2448 URL: https://issues.apache.org/jira/browse/YARN-2448 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2448.0.patch, apache-yarn-2448.1.patch, apache-yarn-2448.2.patch The RM should expose the name of the ResourceCalculator being used when AMs register, as part of the RegisterApplicationMasterResponse. This will allow applications to make better decisions when scheduling. MapReduce for example, only looks at memory when deciding it's scheduling, even though the RM could potentially be using the DominantResourceCalculator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130055#comment-14130055 ] Hudson commented on YARN-2459: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1868 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1868/]) YARN-2459. RM crashes if App gets rejected for any reason and HA is enabled. Contributed by Jian He (xgong: rev 47bdfa044aa1d587b24edae8b1b0c796d829c960) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java Fix CHANGES.txt. Credit Mayank Bansal for his contributions on YARN-2459 (xgong: rev 7d38ffc8d3500d428bdad5640e9e70d66ed5ea60) * hadoop-yarn-project/CHANGES.txt RM crashes if App gets rejected for any reason and HA is enabled Key: YARN-2459 URL: https://issues.apache.org/jira/browse/YARN-2459 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 2.6.0 Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch If RM HA is enabled and used Zookeeper store for RM State Store. If for any reason Any app gets rejected and directly goes to NEW to FAILED then final transition makes that to RMApps and Completed Apps memory structure but that doesn't make it to State store. Now when RMApps default limit reaches it starts deleting apps from memory and store. In that case it try to delete this app from store and fails which causes RM to crash. Thanks, Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130051#comment-14130051 ] Hudson commented on YARN-2440: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1868 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1868/]) YARN-2440. Enabled Nodemanagers to limit the aggregate cpu usage across all containers to a preconfigured limit. Contributed by Varun Vasudev. (vinodkv: rev 4be95175cdb58ff12a27ab443d609d3b46da7bfa) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/NodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestNodeManagerHardwareUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130053#comment-14130053 ] Hudson commented on YARN-415: - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1868 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1868/]) YARN-415. Capture aggregate memory allocation at the app-level for chargeback. Contributed by Eric Payne Andrey Klochkov (jianhe: rev 83be3ad44484bf8a24cb90de4b9c26ab59d226a8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestYarnCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestContainerResourceUsage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/ApplicationAttemptStateData.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebAppFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationResourceUsageReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java *
[jira] [Created] (YARN-2535) Test JIRA, ignore.
Allen Wittenauer created YARN-2535: -- Summary: Test JIRA, ignore. Key: YARN-2535 URL: https://issues.apache.org/jira/browse/YARN-2535 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2535) Test JIRA, ignore.
[ https://issues.apache.org/jira/browse/YARN-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2535: --- Labels: releasenotes (was: ) Test JIRA, ignore. -- Key: YARN-2535 URL: https://issues.apache.org/jira/browse/YARN-2535 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Labels: releasenotes -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2535) Test JIRA, ignore.
[ https://issues.apache.org/jira/browse/YARN-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2535: --- Issue Type: Improvement (was: Bug) Test JIRA, ignore. -- Key: YARN-2535 URL: https://issues.apache.org/jira/browse/YARN-2535 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2535) Test JIRA, ignore.
[ https://issues.apache.org/jira/browse/YARN-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2535: --- Labels: (was: releasenotes) Test JIRA, ignore. -- Key: YARN-2535 URL: https://issues.apache.org/jira/browse/YARN-2535 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2535) Test JIRA, ignore.
[ https://issues.apache.org/jira/browse/YARN-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2535: --- Hadoop Flags: (was: Incompatible change) Test JIRA, ignore. -- Key: YARN-2535 URL: https://issues.apache.org/jira/browse/YARN-2535 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2536) YARN project doesn't have release notes field
Allen Wittenauer created YARN-2536: -- Summary: YARN project doesn't have release notes field Key: YARN-2536 URL: https://issues.apache.org/jira/browse/YARN-2536 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Looking through the Hadoop project JIRAs, I noticed that YARN doesn't seem to have a release notes field. I'm not sure how/why given that at least what I see from the JIRA administration side, it should. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Deleted] (YARN-2536) YARN project doesn't have release notes field
[ https://issues.apache.org/jira/browse/YARN-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer deleted YARN-2536: --- YARN project doesn't have release notes field - Key: YARN-2536 URL: https://issues.apache.org/jira/browse/YARN-2536 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Looking through the Hadoop project JIRAs, I noticed that YARN doesn't seem to have a release notes field. I'm not sure how/why given that at least what I see from the JIRA administration side, it should. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2535) Test JIRA, ignore.
[ https://issues.apache.org/jira/browse/YARN-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer reassigned YARN-2535: -- Assignee: Allen Wittenauer Test JIRA, ignore. -- Key: YARN-2535 URL: https://issues.apache.org/jira/browse/YARN-2535 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer Assignee: Allen Wittenauer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2537) relnotes.py prints description instead of release note for YARN issues
Allen Wittenauer created YARN-2537: -- Summary: relnotes.py prints description instead of release note for YARN issues Key: YARN-2537 URL: https://issues.apache.org/jira/browse/YARN-2537 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Currently, the release notes for YARN always print the description JIRA field instead of the release note. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2537) relnotes.py prints description instead of release note for YARN issues
[ https://issues.apache.org/jira/browse/YARN-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130164#comment-14130164 ] Allen Wittenauer commented on YARN-2537: Digging deeper into the issue, it looks like JIRA is lacking a release notes field for YARN! Opened INFRA-8338 to see if they can figure out why YARN lacks that field. relnotes.py prints description instead of release note for YARN issues -- Key: YARN-2537 URL: https://issues.apache.org/jira/browse/YARN-2537 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Currently, the release notes for YARN always print the description JIRA field instead of the release note. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2464) Provide Hadoop as a local resource (on HDFS) which can be used by other projcets
[ https://issues.apache.org/jira/browse/YARN-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du reassigned YARN-2464: Assignee: Junping Du Provide Hadoop as a local resource (on HDFS) which can be used by other projcets Key: YARN-2464 URL: https://issues.apache.org/jira/browse/YARN-2464 Project: Hadoop YARN Issue Type: Improvement Reporter: Siddharth Seth Assignee: Junping Du DEFAULT_YARN_APPLICATION_CLASSPATH are used by YARN projects to setup their AM / task classpaths if they have a dependency on Hadoop libraries. It'll be useful to provide similar access to a Hadoop tarball (Hadoop libs, native libraries) etc, which could be used instead - for applications which do not want to rely upon Hadoop versions from a cluster node. This would also require functionality to update the classpath/env for the apps based on the structure of the tar. As an example, MR has support for a full tar (for rolling upgrades). Similarly, Tez ships hadoop libraries along with it's build. I'm not sure about the Spark / Storm / HBase model for this - but using a common copy instead of everyone localizing Hadoop libraries would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2104) Scheduler queue filter failed to work because index of queue column changed
[ https://issues.apache.org/jira/browse/YARN-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130268#comment-14130268 ] Ashwin Shankar commented on YARN-2104: -- Hi [~wangda], This issues still happens for FairScheduler(in trunk) but not CapacityScheduler. Can you please check ? Steps to reproduce : 1. Run an app in default queue. 2. While the app is running, go to the scheduler page on RM UI. 3. You would see the app in the apptable at the bottom. 4. Now click default queue to filter the apptable on queue name. 5. App disappears from apptable although it is running on default queue. Scheduler queue filter failed to work because index of queue column changed --- Key: YARN-2104 URL: https://issues.apache.org/jira/browse/YARN-2104 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.5.0 Attachments: YARN-2104.patch YARN-563 added, {code} + th(.type, Application Type”). {code} to application table, which makes queue’s column index from 3 to 4. And in scheduler page, queue’s column index is hard coded to 3 when filter application with queue’s name, {code} if (q == 'root') q = '';, else q = '^' + q.substr(q.lastIndexOf('.') + 1) + '$';, $('#apps').dataTable().fnFilter(q, 3, true);, {code} So queue filter will not work for application page. Reproduce steps: (Thanks Bo Yang for pointing this) {code} 1) In default setup, there’s a default queue under root queue 2) Run an arbitrary application, you can find it in “Applications” page 3) Click “Default” queue in scheduler page 4) Click “Applications”, no application will show here 5) Click “Root” queue in scheduler page 6) Click “Applications”, application will show again {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130282#comment-14130282 ] Xuan Gong commented on YARN-611: Test case (org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat) failure is unrelated Add an AM retry count reset window to YARN RM - Key: YARN-611 URL: https://issues.apache.org/jira/browse/YARN-611 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Chris Riccomini Assignee: Xuan Gong Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch, YARN-611.9.rebase.patch YARN currently has the following config: yarn.resourcemanager.am.max-retries This config defaults to 2, and defines how many times to retry a failed AM before failing the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the NM will timeout, which counts as a failure for the AM), or if the AM dies. This configuration is insufficient for long running (or infinitely running) YARN jobs, since the machine (or NM) that the AM is running on will eventually need to be restarted (or the machine/NM will fail). In such an event, the AM has not done anything wrong, but this is counted as a failure by the RM. Since the retry count for the AM is never reset, eventually, at some point, the number of machine/NM failures will result in the AM failure count going above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the job as failed, and shut it down. This behavior is not ideal. I propose that we add a second configuration: yarn.resourcemanager.am.retry-count-window-ms This configuration would define a window of time that would define when an AM is well behaved, and it's safe to reset its failure count back to zero. Every time an AM fails the RmAppImpl would check the last time that the AM failed. If the last failure was less than retry-count-window-ms ago, and the new failure count is max-retries, then the job should fail. If the AM has never failed, the retry count is max-retries, or if the last failure was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if the last failure was outside the retry-count-window-ms, then the failure count should be set back to 0. This would give developers a way to have well-behaved AMs run forever, while still failing mis-behaving AMs after a short period of time. I think the work to be done here is to change the RmAppImpl to actually look at app.attempts, and see if there have been more than max-retries failures in the last retry-count-window-ms milliseconds. If there have, then the job should fail, if not, then the job should go forward. Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the failure. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2464) Provide Hadoop as a local resource (on HDFS) which can be used by other projcets
[ https://issues.apache.org/jira/browse/YARN-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130283#comment-14130283 ] Junping Du commented on YARN-2464: -- Sync with Sid offline and I will take on this JIRA. A couple of thoughts: - We should leverage YARN's resource localizing mechanism to locate HADOOP jar from a configurable place on HDFS. - AM should have explicit API to set the HADOOP jar (against specific version) path on HDFS for this application as well as the classpath, just like what we do for MR in MAPREDUCE-4421. - The HADOOP jar get loaded to NM local as public resource can shared with different apps running on the same node, but different version of Hadoop jar should have different name. If getting the same name with different content (like checking size, etc.), it should warn with exception. - Through this, running apps on top of YARN doesn't rely on Hadoop JAR on node local which benefit rolling upgrade feature. Provide Hadoop as a local resource (on HDFS) which can be used by other projcets Key: YARN-2464 URL: https://issues.apache.org/jira/browse/YARN-2464 Project: Hadoop YARN Issue Type: Improvement Reporter: Siddharth Seth Assignee: Junping Du DEFAULT_YARN_APPLICATION_CLASSPATH are used by YARN projects to setup their AM / task classpaths if they have a dependency on Hadoop libraries. It'll be useful to provide similar access to a Hadoop tarball (Hadoop libs, native libraries) etc, which could be used instead - for applications which do not want to rely upon Hadoop versions from a cluster node. This would also require functionality to update the classpath/env for the apps based on the structure of the tar. As an example, MR has support for a full tar (for rolling upgrades). Similarly, Tez ships hadoop libraries along with it's build. I'm not sure about the Spark / Storm / HBase model for this - but using a common copy instead of everyone localizing Hadoop libraries would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130292#comment-14130292 ] Junping Du commented on YARN-2033: -- Seems not work. [~zjshen], can you provide a new one here? Thanks! Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, YARN-2033.12.patch, YARN-2033.13.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.13.patch Fix the conflicts against the latest trunk Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, YARN-2033.12.patch, YARN-2033.13.patch, YARN-2033.13.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130368#comment-14130368 ] Andrey Klochkov commented on YARN-415: -- [~eepayne], congratulations and thanks for this tremendous amount of persistence! :-) Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Fix For: 2.6.0 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.201409102216.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130383#comment-14130383 ] Benoy Antony commented on YARN-2527: {code} this.applicationACLsManager.addApplication(applicationId, submissionContext.getAMContainerSpec().getApplicationACLs()); {code} My guess is that {code}submissionContext.getAMContainerSpec().getApplicationACLs(){code} could be null for some Application types. (non mapreduce). NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Klochkov updated YARN-415: - Assignee: Eric Payne (was: Andrey Klochkov) Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Eric Payne Fix For: 2.6.0 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.201409102216.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130479#comment-14130479 ] Hadoop QA commented on YARN-2033: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668114/YARN-2033.13.patch against trunk revision bf64fce. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4893//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4893//console This message is automatically generated. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, YARN-2033.12.patch, YARN-2033.13.patch, YARN-2033.13.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2534) FairScheduler: Potential integer overflow calculating totalMaxShare
[ https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2534: --- Summary: FairScheduler: Potential integer overflow calculating totalMaxShare (was: FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal) FairScheduler: Potential integer overflow calculating totalMaxShare --- Key: YARN-2534 URL: https://issues.apache.org/jira/browse/YARN-2534 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2534.000.patch FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal for some cases. If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE ,but each individual MAX share is not equal to Integer.MAX_VALUE. then totalMaxShare will be a negative value, which will cause all fairShare are wrongly calculated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2534) FairScheduler: Potential integer overflow calculating totalMaxShare
[ https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130519#comment-14130519 ] Karthik Kambatla commented on YARN-2534: Pretty straight-forward fix. +1. FairScheduler: Potential integer overflow calculating totalMaxShare --- Key: YARN-2534 URL: https://issues.apache.org/jira/browse/YARN-2534 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2534.000.patch FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal for some cases. If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE ,but each individual MAX share is not equal to Integer.MAX_VALUE. then totalMaxShare will be a negative value, which will cause all fairShare are wrongly calculated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-1104) NMs to support rolling logs of stdout stderr
[ https://issues.apache.org/jira/browse/YARN-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-1104. --- Resolution: Duplicate This is an older ticket, but I am closing this as dup of YARN-2468 where patches are already posted, and the work is in progress.. NMs to support rolling logs of stdout stderr -- Key: YARN-1104 URL: https://issues.apache.org/jira/browse/YARN-1104 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.1.0-beta Reporter: Steve Loughran Assignee: Xuan Gong Currently NMs stream the stdout and stderr streams of a container to a file. For longer lived processes those files need to be rotated so that the log doesn't overflow -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130554#comment-14130554 ] Anubhav Dhoot commented on YARN-1372: - why adding context.getContainers().remove(cid); in removeVeryOldStoppedContainersFromContext method? won’t this remove the containers from context immediately when we send the container statuses across, which contradicts the rest of the changes? two reasons. 1) To enforce a timeout for the context entries as a safety net (this should happen only after the timeout which is defaulted to 10 min). Otherwise I am worried there will be cases where the entries do not get removed for a long time. Is is possible that for some reason there is no ack from AM and the application never gets removed and these entries stay in memory? 2) If we are removing it from the nm store, is there any value in keeping it in memory? If NM restarts, its not going to know about this anyway. Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2484) FileSystemRMStateStore#readFile/writeFile should close FSData(In|Out)putStream in final block
[ https://issues.apache.org/jira/browse/YARN-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130560#comment-14130560 ] Jason Lowe commented on YARN-2484: -- Thanks for the report and patch, Tsuyoshi! I'm a bit concerned that calling close() after encountering an error could itself throw another exception and obscure the original error. So we may want to do something more like this instead: {code} try { ... f.close(); f = null; } finally { IOUtils.cleanup(LOG, f); } {code} Also in the read case I'm not sure we care about the result of close(), since we have all of the data we need. I don't want to take down the RM because there was an error closing a file that we completely read. FileSystemRMStateStore#readFile/writeFile should close FSData(In|Out)putStream in final block - Key: YARN-2484 URL: https://issues.apache.org/jira/browse/YARN-2484 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2484.1.patch File descriptors can leak if exceptions are thrown in these methods. {code} private byte[] readFile(Path inputPath, long len) throws Exception { FSDataInputStream fsIn = fs.open(inputPath); // state data will not be that long byte[] data = new byte[(int)len]; fsIn.readFully(data); fsIn.close(); return data; } {code} {code} private void writeFile(Path outputPath, byte[] data) throws Exception { Path tempPath = new Path(outputPath.getParent(), outputPath.getName() + .tmp); FSDataOutputStream fsOut = null; // This file will be overwritten when app/attempt finishes for saving the // final status. fsOut = fs.create(tempPath, true); fsOut.write(data); fsOut.close(); fs.rename(tempPath, outputPath); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2538) Add logs when RM send new AMRMToken to ApplicationMaster
[ https://issues.apache.org/jira/browse/YARN-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2538: Attachment: YARN-2538.1.patch Trivial patch Add logs when RM send new AMRMToken to ApplicationMaster Key: YARN-2538 URL: https://issues.apache.org/jira/browse/YARN-2538 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2538.1.patch This is for testing/debugging purpose -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
[ https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2357: --- Attachment: YARN-2357.2.patch .2.patch is port of YARN-2198.trunk.4.patch. Applyes cleanly to branch-2 and builds fine. Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2 -- Key: YARN-2357 URL: https://issues.apache.org/jira/browse/YARN-2357 Project: Hadoop YARN Issue Type: Task Components: nodemanager Affects Versions: 2.4.0 Reporter: Remus Rusanu Assignee: Remus Rusanu Priority: Critical Labels: security, windows Attachments: YARN-2357.1.patch, YARN-2357.2.patch As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130691#comment-14130691 ] Jian He commented on YARN-1372: --- Thanks for your explanation, bq. Is is possible that for some reason there is no ack from AM and the application never gets removed and these entries stay in memory? ApplicationImpl on NM should be guaranteed to be cleaned for already completed applications. (Otherwise, it's a leak. we should fix this too.) bq. If we are removing it from the nm store, is there any value in keeping it in memory? If NM restarts, its not going to know about this anyway. That's why I said in my previous comment: {{make sure context.getNMStateStore().removeContainer(cid); is called after receiving the notification from RM as well.}} One other thing is: - In RMAppAttemptImpl#pullJustFinishedContainers, we may just send the whole list of containers in one event; Instead of sending individual event for each container. {code} for (Map.EntryContainerStatus, NodeId finishedContainerStatus: this .finishedContainersSentToAM.entrySet()) { // Implicitly acks the previous list as being received by the AM eventHandler.handle(new RMNodeCleanedupContainerNotifiedEvent( finishedContainerStatus.getValue(), finishedContainerStatus .getKey().getContainerId())); } {code} Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2032: Attachment: YARN-2032-091114.patch Hi everyone, based on [~mayank_bansal]'s patch, I've done an updated version for the HBase timeline storage. General strategy and data schema are kept the same as the original patch. Here're what I've done: 1. Finished fromTs and fromId function in getEntities. For fromId, I added one more check to change the first record of a scan if necessary. For fromTs, I added a separate column qualifier in the entity table to store insert time for each entity, and, during query, filter out later records if necessary. 2. I restructured the code such that data schema and operations are decoupled from the actual HBase operations. Most timeline data storage logic is in the abstract store class now. While HBase storage class only needs to implement the abstract methods and interfaces required by the abstract storage class, including table creation, get, put, scan, and a few other helper functions. I hope this pluggable interface would simplify future extensions, and help to provide a unified abstract storage schema across different data storages. Comments on this design would certainly be more than welcome. 3. I've added two more unit test cases, to test fromId and fromTs, for HBase storage. Currently, the UTs work fine with branch-2, or with a HttpServer.java accessible by HBaseClient. But UTs are failing in trunk branch since HttpServer.java has been replaced. I added Ignore tags to the UT for now, but feel free to check them under branch-2. Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2032-091114.patch, YARN-2032-branch-2-1.patch, YARN-2032-branch2-2.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2538) Add logs when RM send new AMRMToken to ApplicationMaster
[ https://issues.apache.org/jira/browse/YARN-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130800#comment-14130800 ] Hadoop QA commented on YARN-2538: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668150/YARN-2538.1.patch against trunk revision c656d7d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4894//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4894//console This message is automatically generated. Add logs when RM send new AMRMToken to ApplicationMaster Key: YARN-2538 URL: https://issues.apache.org/jira/browse/YARN-2538 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2538.1.patch This is for testing/debugging purpose -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130838#comment-14130838 ] Jian He commented on YARN-2229: --- looks good, +1 ContainerId can overflow with RM restart Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.10.patch, YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, YARN-2229.9.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2539) FairScheduler: Update the default value for maxAMShare
Wei Yan created YARN-2539: - Summary: FairScheduler: Update the default value for maxAMShare Key: YARN-2539 URL: https://issues.apache.org/jira/browse/YARN-2539 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Currently, the maxAMShare per queue is -1 in default, which disables the AM share constraint. Change to 0.5f would be good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2001: -- Attachment: YARN-2001.3.patch patch rebased Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2539) FairScheduler: Update the default value for maxAMShare
[ https://issues.apache.org/jira/browse/YARN-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130881#comment-14130881 ] Ashwin Shankar commented on YARN-2539: -- Hey [~ywskycn], What about the problem described in YARN-2187 due to maxAMShare enabled by default ? FairScheduler: Update the default value for maxAMShare -- Key: YARN-2539 URL: https://issues.apache.org/jira/browse/YARN-2539 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Currently, the maxAMShare per queue is -1 in default, which disables the AM share constraint. Change to 0.5f would be good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2538) Add logs when RM send new AMRMToken to ApplicationMaster
[ https://issues.apache.org/jira/browse/YARN-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2538: Attachment: YARN-2538.1.patch Add logs when RM send new AMRMToken to ApplicationMaster Key: YARN-2538 URL: https://issues.apache.org/jira/browse/YARN-2538 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2538.1.patch, YARN-2538.1.patch This is for testing/debugging purpose -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2540) Fair Scheduler : queue filters not working on scheduler page in RM UI
[ https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130889#comment-14130889 ] Ashwin Shankar commented on YARN-2540: -- Note : this problem happens when we want to filter by any queue, not just root.default. Fair Scheduler : queue filters not working on scheduler page in RM UI - Key: YARN-2540 URL: https://issues.apache.org/jira/browse/YARN-2540 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0, 2.5.1 Reporter: Ashwin Shankar Assignee: Ashwin Shankar Steps to reproduce : 1. Run an app in default queue. 2. While the app is running, go to the scheduler page on RM UI. 3. You would see the app in the apptable at the bottom. 4. Now click on default queue to filter the apptable on root.default. 5. App disappears from apptable although it is running on default queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2539) FairScheduler: Update the default value for maxAMShare
[ https://issues.apache.org/jira/browse/YARN-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130891#comment-14130891 ] Wei Yan commented on YARN-2539: --- YARN-2187 is a temp solution to disable the max am share, because previous the fair share is the steady fair share. The steady fair share can be very small if we have many queues. As now we compute max am share with the dynamic fair share, so we reset the default value to a reasonable one. And, there may be a bug in the maxAMShare calculation. When one queue receives the first application, its fair share is still 0, which means the queue cannot accept the AM container. I'm double checking this problem. FairScheduler: Update the default value for maxAMShare -- Key: YARN-2539 URL: https://issues.apache.org/jira/browse/YARN-2539 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Currently, the maxAMShare per queue is -1 in default, which disables the AM share constraint. Change to 0.5f would be good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)
[ https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1708: --- Attachment: YARN-1708.patch Rebased patch after sync-ing branch yarn-1051 with trunk Add a public API to reserve resources (part of YARN-1051) - Key: YARN-1708 URL: https://issues.apache.org/jira/browse/YARN-1708 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Subramaniam Krishnan Attachments: YARN-1708.patch, YARN-1708.patch, YARN-1708.patch, YARN-1708.patch This JIRA tracks the definition of a new public API for YARN, which allows users to reserve resources (think of time-bounded queues). This is part of the admission control enhancement proposed in YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1707: --- Attachment: YARN-1707.10.patch Rebased patch after sync-ing branch yarn-1051 with trunk Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.10.patch, YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, YARN-1707.5.patch, YARN-1707.6.patch, YARN-1707.7.patch, YARN-1707.8.patch, YARN-1707.9.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1709) Admission Control: Reservation subsystem
[ https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1709: --- Attachment: YARN-1709.patch Rebased patch after sync-ing branch yarn-1051 with trunk Admission Control: Reservation subsystem Key: YARN-1709 URL: https://issues.apache.org/jira/browse/YARN-1709 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Subramaniam Krishnan Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The Reservation subsystem is conceptually a plan of how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1711: --- Attachment: YARN-1711.3.patch Rebased patch after sync-ing branch yarn-1051 with trunk CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Attachments: YARN-1711.1.patch, YARN-1711.2.patch, YARN-1711.3.patch, YARN-1711.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-2080: --- Attachment: YARN-2080.patch Rebased patch after sync-ing branch yarn-1051 with trunk Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subramaniam Krishnan Assignee: Subramaniam Krishnan Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.3.patch Uploaded a new consolidated patch against latest trunk for you to play. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1709) Admission Control: Reservation subsystem
[ https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130918#comment-14130918 ] Hadoop QA commented on YARN-1709: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668205/YARN-1709.patch against trunk revision 6c08339. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4898//console This message is automatically generated. Admission Control: Reservation subsystem Key: YARN-1709 URL: https://issues.apache.org/jira/browse/YARN-1709 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Subramaniam Krishnan Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The Reservation subsystem is conceptually a plan of how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130920#comment-14130920 ] Hadoop QA commented on YARN-1707: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668203/YARN-1707.10.patch against trunk revision 6c08339. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4899//console This message is automatically generated. Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.10.patch, YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, YARN-1707.5.patch, YARN-1707.6.patch, YARN-1707.7.patch, YARN-1707.8.patch, YARN-1707.9.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)
[ https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130923#comment-14130923 ] Hadoop QA commented on YARN-1708: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668202/YARN-1708.patch against trunk revision 6c08339. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4900//console This message is automatically generated. Add a public API to reserve resources (part of YARN-1051) - Key: YARN-1708 URL: https://issues.apache.org/jira/browse/YARN-1708 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Subramaniam Krishnan Attachments: YARN-1708.patch, YARN-1708.patch, YARN-1708.patch, YARN-1708.patch This JIRA tracks the definition of a new public API for YARN, which allows users to reserve resources (think of time-bounded queues). This is part of the admission control enhancement proposed in YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130936#comment-14130936 ] Hadoop QA commented on YARN-2001: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668193/YARN-2001.3.patch against trunk revision 6c08339. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4896//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4896//console This message is automatically generated. Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130935#comment-14130935 ] Hadoop QA commented on YARN-2032: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668169/YARN-2032-091114.patch against trunk revision c656d7d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 24 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1265 javac compiler warnings (more than the trunk's current 1264 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.mover.TestStorageMover org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes org.apache.hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4895//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4895//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4895//console This message is automatically generated. Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2032-091114.patch, YARN-2032-branch-2-1.patch, YARN-2032-branch2-2.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2539) FairScheduler: Update the default value for maxAMShare
[ https://issues.apache.org/jira/browse/YARN-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2539: -- Attachment: YARN-2539-1.patch FairScheduler: Update the default value for maxAMShare -- Key: YARN-2539 URL: https://issues.apache.org/jira/browse/YARN-2539 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2539-1.patch Currently, the maxAMShare per queue is -1 in default, which disables the AM share constraint. Change to 0.5f would be good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2539) FairScheduler: Update the default value for maxAMShare
[ https://issues.apache.org/jira/browse/YARN-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130942#comment-14130942 ] Wei Yan commented on YARN-2539: --- bq. And, there may be a bug in the maxAMShare calculation. When one queue receives the first application, its fair share is still 0, which means the queue cannot accept the AM container. I'm double checking this problem. This cannot happen. The only concern is that the first application to an empty queue may need to wait for 0.5 second. The updateThread will set the update the fair share from 0 to a new value; only after that, this application has chance to run. FairScheduler: Update the default value for maxAMShare -- Key: YARN-2539 URL: https://issues.apache.org/jira/browse/YARN-2539 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2539-1.patch Currently, the maxAMShare per queue is -1 in default, which disables the AM share constraint. Change to 0.5f would be good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130945#comment-14130945 ] Hadoop QA commented on YARN-1711: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668207/YARN-1711.3.patch against trunk revision 6c08339. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4901//console This message is automatically generated. CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Attachments: YARN-1711.1.patch, YARN-1711.2.patch, YARN-1711.3.patch, YARN-1711.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130946#comment-14130946 ] Hadoop QA commented on YARN-2080: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668208/YARN-2080.patch against trunk revision 6c08339. {color:red}-1 @author{color}. The patch appears to contain @author tags which the Hadoop community has agreed to not allow in code contributions. {color:green}+1 tests included{color}. The patch appears to include new or modified test files. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4902//console This message is automatically generated. Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subramaniam Krishnan Assignee: Subramaniam Krishnan Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2541) Fix ResourceManagerRest.apt.vm syntax error
Jian He created YARN-2541: - Summary: Fix ResourceManagerRest.apt.vm syntax error Key: YARN-2541 URL: https://issues.apache.org/jira/browse/YARN-2541 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2541) Fix ResourceManagerRest.apt.vm syntax error
[ https://issues.apache.org/jira/browse/YARN-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2541: -- Attachment: YARN-2541.1.patch Patch to fix the table syntax Fix ResourceManagerRest.apt.vm syntax error --- Key: YARN-2541 URL: https://issues.apache.org/jira/browse/YARN-2541 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2541.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2541) Fix ResourceManagerRest.apt.vm syntax error
[ https://issues.apache.org/jira/browse/YARN-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130955#comment-14130955 ] Karthik Kambatla commented on YARN-2541: +1, pending Jenkins. Fix ResourceManagerRest.apt.vm syntax error --- Key: YARN-2541 URL: https://issues.apache.org/jira/browse/YARN-2541 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2541.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2538) Add logs when RM send new AMRMToken to ApplicationMaster
[ https://issues.apache.org/jira/browse/YARN-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130958#comment-14130958 ] Hadoop QA commented on YARN-2538: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668200/YARN-2538.1.patch against trunk revision 6c08339. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4897//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4897//console This message is automatically generated. Add logs when RM send new AMRMToken to ApplicationMaster Key: YARN-2538 URL: https://issues.apache.org/jira/browse/YARN-2538 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2538.1.patch, YARN-2538.1.patch This is for testing/debugging purpose -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2541) Fix ResourceManagerRest.apt.vm syntax error
[ https://issues.apache.org/jira/browse/YARN-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2541: -- Description: the incorrect table syntax somehow causes hadoop-yarn-site intermittent build failure as in https://jira.codehaus.org/browse/DOXIA-453 Fix ResourceManagerRest.apt.vm syntax error --- Key: YARN-2541 URL: https://issues.apache.org/jira/browse/YARN-2541 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2541.1.patch the incorrect table syntax somehow causes hadoop-yarn-site intermittent build failure as in https://jira.codehaus.org/browse/DOXIA-453 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1712) Admission Control: plan follower
[ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1712: --- Attachment: YARN-1712.4.patch Thanks [~leftnoteasy] for reviewing the patch. I have wrapped the debug logs with _isDebugEnabled()_. This patch is also rebased post sync-ing of branch yarn-1051 with trunk Admission Control: plan follower Key: YARN-1712 URL: https://issues.apache.org/jira/browse/YARN-1712 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations, scheduler Attachments: YARN-1712.1.patch, YARN-1712.2.patch, YARN-1712.3.patch, YARN-1712.4.patch, YARN-1712.patch This JIRA tracks a thread that continuously propagates the current state of an inventory subsystem to the scheduler. As the inventory subsystem store the plan of how the resources should be subdivided, the work we propose in this JIRA realizes such plan by dynamically instructing the CapacityScheduler to add/remove/resize queues to follow the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2475) ReservationSystem: replan upon capacity reduction
[ https://issues.apache.org/jira/browse/YARN-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-2475: --- Attachment: YARN-2475.patch Thanks [~chris.douglas] for reviewing the patch. I am uploading a patch that addresses all your comments (skipping relisting them). bq. Why is the enforcement window tied to CapacitySchedulerConfiguration? The replanner can be configured per plan which in turn translates to a leaf queue in capacity scheduler configuration. Consequently the enforcement window is configured for the replanner via the capacity scheduler leaf queue configuration. ReservationSystem: replan upon capacity reduction - Key: YARN-2475 URL: https://issues.apache.org/jira/browse/YARN-2475 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Attachments: YARN-2475.patch, YARN-2475.patch In the context of YARN-1051, if capacity of the cluster drops significantly upon machine failures we need to trigger a reorganization of the planned reservations. As reservations are absolute it is possible that they will not all fit, and some need to be rejected a-posteriori. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130972#comment-14130972 ] Li Lu commented on YARN-2032: - Oh sorry, it should be HDFS-6584. Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2032-091114.patch, YARN-2032-branch-2-1.patch, YARN-2032-branch2-2.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130971#comment-14130971 ] Li Lu commented on YARN-2032: - The HDFS unit test failures are very confusing here. TestStorageMover is even not in trunk. As pointed by [~jingzhao], the result of this run may come from HDFS-6564. Apparently there were something wrong during this test run. Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2032-091114.patch, YARN-2032-branch-2-1.patch, YARN-2032-branch2-2.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.3.4#6332)