[jira] [Assigned] (YARN-1494) YarnClient doesn't wrap renewDelegationToken/cancelDelegationToken of ApplicationClientProtocol
[ https://issues.apache.org/jira/browse/YARN-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-1494: -- Assignee: Varun Saxena YarnClient doesn't wrap renewDelegationToken/cancelDelegationToken of ApplicationClientProtocol --- Key: YARN-1494 URL: https://issues.apache.org/jira/browse/YARN-1494 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Varun Saxena YarnClient doesn't wrap renewDelegationToken/cancelDelegationToken of ApplicationClientProtocol, but getDelegationToken. After YARN-1363, renewDelegationToken/cancelDelegationToken are going to be async, such that procedure of canceling/renewing a DT is not that straightforward. It's better to wrap these two APIs as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2190) Provide a Windows container executor that can limit memory and CPU
[ https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuan Liu updated YARN-2190: Attachment: YARN-2190.7.patch Attach a new patch that incorporated latest changes in winutils. Provide a Windows container executor that can limit memory and CPU -- Key: YARN-2190 URL: https://issues.apache.org/jira/browse/YARN-2190 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Chuan Liu Assignee: Chuan Liu Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch, YARN-2190.5.patch, YARN-2190.6.patch, YARN-2190.7.patch Yarn default container executor on Windows does not set the resource limit on the containers currently. The memory limit is enforced by a separate monitoring thread. The container implementation on Windows uses Job Object right now. The latest Windows (8 or later) API allows CPU and memory limits on the job objects. We want to create a Windows container executor that sets the limits on job objects thus provides resource enforcement at OS level. http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2190) Provide a Windows container executor that can limit memory and CPU
[ https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222776#comment-14222776 ] Hadoop QA commented on YARN-2190: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683290/YARN-2190.7.patch against trunk revision 555fa2d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5915//console This message is automatically generated. Provide a Windows container executor that can limit memory and CPU -- Key: YARN-2190 URL: https://issues.apache.org/jira/browse/YARN-2190 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Chuan Liu Assignee: Chuan Liu Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch, YARN-2190.5.patch, YARN-2190.6.patch, YARN-2190.7.patch Yarn default container executor on Windows does not set the resource limit on the containers currently. The memory limit is enforced by a separate monitoring thread. The container implementation on Windows uses Job Object right now. The latest Windows (8 or later) API allows CPU and memory limits on the job objects. We want to create a Windows container executor that sets the limits on job objects thus provides resource enforcement at OS level. http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222823#comment-14222823 ] Devaraj K commented on YARN-2877: - +1 for the idea [~sriramsrao], [~curino] . I just wanted to know these if I am not missing something from the above. 1. If the OPTIMISTIC Container is assigned to AM, and also at the same time RM assigned a container i.e. CONSERVATIVE for the same resource, which one NM will consider and start it? 2. If the OPTIMISTIC Container is assigned to AM and started it, and NM receives a container start request for CONSERVATIVE and resources are not available, will the NM preempt the running OPTIMISTIC Containers or it will make CONSERVATIVE request to wait for completing the OPTIMISTIC Containers? 3. Any provision for AM to request OPTIMISTIC containers in the remote NM also? Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2892) Unable to get AMRMToken in unmanaged AM when using a secure cluster
[ https://issues.apache.org/jira/browse/YARN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K reassigned YARN-2892: --- Assignee: Sevada Abraamyan (was: Rohith) I added [~sevada] as contributor and assigned this. Unable to get AMRMToken in unmanaged AM when using a secure cluster --- Key: YARN-2892 URL: https://issues.apache.org/jira/browse/YARN-2892 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Sevada Abraamyan Assignee: Sevada Abraamyan An AMRMToken is retrieved from the ApplicationReport by the YarnClient. When the RM creates the ApplicationReport and sends it back to the client it makes a simple security check whether it should include the AMRMToken in the report (See createAndGetApplicationReport in RMAppImpl).This security check verifies that the user who submitted the original application is the same user who is requesting the ApplicationReport. If they are indeed the same user then it includes the AMRMToken, otherwise it does not include it. The problem arises from the fact that when an application is submitted, the RM saves the short username of the user who created the application (See submitApplication in ClientRmService). Afterwards when the ApplicationReport is requested, the system tries to match the full username of the requester against the previously stored short username. In a secure cluster using Kerberos this check fails because the principle is stripped from the username when we request a short username. So for example the short username might be Foo whereas the full username is f...@company.com Note: A very similar problem has been previously reported ([Yarn-2232|https://issues.apache.org/jira/browse/YARN-2232]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2243) Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor
[ https://issues.apache.org/jira/browse/YARN-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-2243: Affects Version/s: 2.5.1 Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor Key: YARN-2243 URL: https://issues.apache.org/jira/browse/YARN-2243 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Ted Yu Assignee: Devaraj K Priority: Minor Attachments: YARN-2243.patch, YARN-2243.patch {code} public SchedulerApplicationAttempt(ApplicationAttemptId applicationAttemptId, String user, Queue queue, ActiveUsersManager activeUsersManager, RMContext rmContext) { Preconditions.checkNotNull(RMContext should not be null, rmContext); {code} Order of arguments is wrong for Preconditions.checkNotNull(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222872#comment-14222872 ] Konstantinos Karanasos commented on YARN-2877: -- [~wangda], regarding your question about how the AM will know which NM is more idle than others, this is related with YARN-2886. Each NM estimates its waiting queue time (based on the tasks running and those waiting in the queue already) and sends this waiting time to the RM through the heartbeat. Note that this is just an integer, so it is very lightweight. Then the RM can push this information to the rest of the NMs (again through the heartbeats). This way each node knows the queue status of the other NMs and can decide where to queue its queueable requests. However, since this information may be always precise (due to bad estimation or stale info), we also introduce correction mechanisms for rebalancing the queues, if need be (YARN-2888). Regarding your other questions: # These malicious AMs is one of the basic reasons we have introduced the Local RM. The AMs can make queueable requests only to the Local RM, who can throttle down aggressive AMs without even needing to reach the central RM. Clearly, as you mention, the central RM can also be involved for imposing elaborate fairness/capacity constraints, if those are needed. # Promoting a queueable container to a guaranteed-start one is indeed interesting, and we have been investigating the cases for which it would bring benefits. One is the case you mention. Another is in case a queueable container has been pre-empted/killed many times due to other guaranteed-start requests. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222876#comment-14222876 ] Konstantinos Karanasos commented on YARN-2877: -- [~devaraj], to answer your questions: # Guaranteed-start containers always have priority over queueable ones. Thus, in the case you describe, if not both requests can be accommodated by the NM, the guaranteed-start will start first. # If the queueable one was started before the guaranteed-start arrived, it will be pre-empted/killed for the guaranteed-start to begin execution. # Queueable requests are submitted by the AM in the Local RM running in the same node as the AM, but those requests can be queued at any NM of the cluster (we pick at each moment the most idle ones to queue those requests). Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222880#comment-14222880 ] Konstantinos Karanasos commented on YARN-2877: -- I used the wrong name in the above comment -- it was referring to [~devaraj.k]'s comment. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2894) Disallow binding of aclManagers while starting RMWebApp
Rohith created YARN-2894: Summary: Disallow binding of aclManagers while starting RMWebApp Key: YARN-2894 URL: https://issues.apache.org/jira/browse/YARN-2894 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith Fix For: 2.7.0 Binding aclManager to RMWebApp would cause problem if RM is switched. There could be some validation check may fail. I think , we should not bind aclManager for RMWebApp, instead we should get from RM instance. In RMWebApp, {code} if (rm != null) { bind(ResourceManager.class).toInstance(rm); bind(RMContext.class).toInstance(rm.getRMContext()); bind(ApplicationACLsManager.class).toInstance( rm.getApplicationACLsManager()); bind(QueueACLsManager.class).toInstance(rm.getQueueACLsManager()); } {code} and in AppBlock#render below check may fail(Need to test and confirm) {code} if (callerUGI != null !(this.aclsManager.checkAccess(callerUGI, ApplicationAccessType.VIEW_APP, app.getUser(), appID) || this.queueACLsManager.checkAccess(callerUGI, QueueACL.ADMINISTER_QUEUE, app.getQueue( { puts(You (User + remoteUser + ) are not authorized to view application + appID); return; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223079#comment-14223079 ] Jason Lowe commented on YARN-1984: -- Thanks for picking this up, Varun. getStartTimeLong can leak the runtime DBException and shouldn't. Is there a reason to have deleteNextEntity throw DBException rather than IOException? It would be cleaner for callers if deleteNextEnttiy handled this. loadVersion can leak the runtime DBException LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1963: -- Attachment: YARN Application Priorities Design_01.pdf Updated design doc as per the comments from [~wangda] Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: YARN Application Priorities Design.pdf, YARN Application Priorities Design_01.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223091#comment-14223091 ] Mit Desai commented on YARN-2517: - I had similar concerns. Do we really need this at this point? And as Hitesh pointed out, if this may hinder the design in future. bq. Also, is the timeline layer meant to eventually be reliable and always up? As far as what I am aware of, this is not going to be in near future. Implement TimelineClientAsync - Key: YARN-2517 URL: https://issues.apache.org/jira/browse/YARN-2517 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2517.1.patch, YARN-2517.2.patch In some scenarios, we'd like to put timeline entities in another thread no to block the current one. It's good to have a TimelineClientAsync like AMRMClientAsync and NMClientAsync. It can buffer entities, put them in a separate thread, and have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2691) User level API support for priority label
[ https://issues.apache.org/jira/browse/YARN-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223100#comment-14223100 ] Sunil G commented on YARN-2691: --- Hi [~rohithsharma] This patch might need rebasing. Pls rebase against trunk. User level API support for priority label - Key: YARN-2691 URL: https://issues.apache.org/jira/browse/YARN-2691 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Sunil G Assignee: Rohith Attachments: YARN-2691.patch Support for handling Application-Priority label coming from client to ApplicationSubmissionContext. Common api support for user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2895) Integrate distributed scheduling with capacity scheduler
Wangda Tan created YARN-2895: Summary: Integrate distributed scheduling with capacity scheduler Key: YARN-2895 URL: https://issues.apache.org/jira/browse/YARN-2895 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Wangda Tan Assignee: Wangda Tan There're some benefit to integrate distributed scheduling mechanism (LocalRM) with capacity scheduler: - Resource usage of opportunistic container can be tracked by central RM and capacity could be enforced - Opportunity to transfer opportunistic container to conservative container -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223127#comment-14223127 ] Wangda Tan commented on YARN-1963: -- [~sunilg], I agree with your latest comment, Will get back to you once I read the new design doc. Thanks, Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: YARN Application Priorities Design.pdf, YARN Application Priorities Design_01.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2896) Server side PB changes for Priority Label Manager and Admin CLI support
Sunil G created YARN-2896: - Summary: Server side PB changes for Priority Label Manager and Admin CLI support Key: YARN-2896 URL: https://issues.apache.org/jira/browse/YARN-2896 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sunil G Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2025) Possible NPE in schedulers#addApplicationAttempt()
[ https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223138#comment-14223138 ] Rohith commented on YARN-2025: -- Impact from this is both RM's are in standby and not able to recover at all. Possible NPE in schedulers#addApplicationAttempt() -- Key: YARN-2025 URL: https://issues.apache.org/jira/browse/YARN-2025 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2025.1.patch In FifoScheduler/FairScheduler/CapacityScheduler#addApplicationAttempt(), we don't check whether {{application}} is null. This can cause NPE in following sequences: addApplication() - doneApplication() (e.g. AppKilledTransition) - addApplicationAttempt(). {code} SchedulerApplication application = applications.get(applicationAttemptId.getApplicationId()); String user = application.getUser(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2896) Server side PB changes for Priority Label Manager and Admin CLI support
[ https://issues.apache.org/jira/browse/YARN-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2896: -- Description: Common changes: * PB support changes required for Admin APIs * PB support for File System store (Priority Label Store) Server side PB changes for Priority Label Manager and Admin CLI support --- Key: YARN-2896 URL: https://issues.apache.org/jira/browse/YARN-2896 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Sunil G Assignee: Sunil G Common changes: * PB support changes required for Admin APIs * PB support for File System store (Priority Label Store) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2896) Server side PB changes for Priority Label Manager and Admin CLI support
[ https://issues.apache.org/jira/browse/YARN-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2896: -- Attachment: 0001-YARN-2896.patch Uploading an initial patch for common PB support. This patch is needed for Priority Label manager. Tests also will be added soon. Server side PB changes for Priority Label Manager and Admin CLI support --- Key: YARN-2896 URL: https://issues.apache.org/jira/browse/YARN-2896 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2896.patch Common changes: * PB support changes required for Admin APIs * PB support for File System store (Priority Label Store) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2025) Possible NPE in schedulers#addApplicationAttempt()
[ https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223136#comment-14223136 ] Rohith commented on YARN-2025: -- I ran into weird scenario where I got the NPE in {{CapacityScheduler.addApplicationAttempt}} in a different manner. I could able to get some informationf from the logs but not fully since log were rolled out. Application final state is FAILED but ApplicationAttempt final state is null. This looks very strange that how can RMApp-FAILED but RMAppAttempt-null..? Extracted log from RM is below. Because of this scenario, application recovery throw NPE since RMAppAttempt tries to add attempt to scheduler but application details are not added to schedulers. {noformat} 2014-11-24 23:53:32,608 | INFO | main-EventThread | Recovering app: application_1416805604019_0038 with 1 attempts and final state = FAILED | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:700) 2014-11-24 23:53:32,609 | INFO | main-EventThread | Recovering attempt: appattempt_1416805604019_0038_01 with final state: null | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:735) {noformat} NPE trace as follows. {noformat} 2014-11-24 23:53:32,610 | ERROR | main-EventThread | Failed to load/recover state | org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:527) java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:607) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:941) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:97) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:963) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:931) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:698) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:803) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:95) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:825) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:808) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:681) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:335) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1148) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:523) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:927) {noformat} Possible NPE in schedulers#addApplicationAttempt() -- Key: YARN-2025 URL: https://issues.apache.org/jira/browse/YARN-2025 Project: Hadoop YARN Issue Type: Bug Reporter:
[jira] [Commented] (YARN-2025) Possible NPE in schedulers#addApplicationAttempt()
[ https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223142#comment-14223142 ] Hadoop QA commented on YARN-2025: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643596/YARN-2025.1.patch against trunk revision 555fa2d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5916//console This message is automatically generated. Possible NPE in schedulers#addApplicationAttempt() -- Key: YARN-2025 URL: https://issues.apache.org/jira/browse/YARN-2025 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2025.1.patch In FifoScheduler/FairScheduler/CapacityScheduler#addApplicationAttempt(), we don't check whether {{application}} is null. This can cause NPE in following sequences: addApplication() - doneApplication() (e.g. AppKilledTransition) - addApplicationAttempt(). {code} SchedulerApplication application = applications.get(applicationAttemptId.getApplicationId()); String user = application.getUser(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2693) Priority Label Manager in RM to manage priority labels
[ https://issues.apache.org/jira/browse/YARN-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2693: -- Attachment: 0002-YARN-2693.patch Updating patch after moving the PB changes to a common JIRA which handles only PB related changes. Also moved the ApplicationPriority class to user api support JIRA. Tests will be added soon. Kindly check. Priority Label Manager in RM to manage priority labels -- Key: YARN-2693 URL: https://issues.apache.org/jira/browse/YARN-2693 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2693.patch, 0002-YARN-2693.patch Focus of this JIRA is to have a centralized service to handle priority labels. Support operations such as * Add/Delete priority label to a specified queue * Manage integer mapping associated with each priority label * Support managing default priority label of a given queue * ACL support in queue level for priority label * Expose interface to RM to validate priority label Storage for this labels will be done in FileSystem and in Memory similar to NodeLabel * FileSystem Based : persistent across RM restart * Memory Based: non-persistent across RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223197#comment-14223197 ] Carlo Curino commented on YARN-2877: I am going to echo [~kkaranasos] regarding malicious AMs. The key architectural change we propose is to introduce a proxy layer (YARN-2884). This is giving us a place that is both distributed, but part of the infrastructure (thus inherently trusted) where to enact policies. This is where we host the LocalRM functionality of YARN-2885. With this in place we do not have to depend on the trusting the AM regarding distributed decisions (the AM only exposes need for containers of different type). On the contrary, we can enable a broad spectrum of infrastructure-level policies, that can leverage explicit or implicit information to impose caps, or to balance (or skew) where the queuable containers should be allocated etc. As we have done in the past, we are working towards providing rather *general purpose mechanisms*, and propose a *first set of policies* (AM, LocalRM, NM start/stop of containers). Policies can be evolved/overridden easily depending on use-cases, while mechanisms are a little harder to change. To this end, discussing carefully other use cases, such as the conversation around using queuable containers for Impala, is very important, as we might have missed hooks as part of the mechanisms, that are necessary to support those scenarios. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223203#comment-14223203 ] Wangda Tan commented on YARN-2801: -- Since the assignee is empty and got no response, taking over. Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gururaj Shetty Assignee: Wangda Tan Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2801: Assignee: Wangda Tan Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gururaj Shetty Assignee: Wangda Tan Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2637: -- Attachment: YARN-2637.9.patch This patch seems to pass all the existing unit tests on my box, verifing. Still todo, unit test for change as such, remove some extra logging. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.2.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223219#comment-14223219 ] Rohith commented on YARN-2762: -- Am little confused about HadoopQA result. Am able to apply patch successfully. I rekick jenkins to check again whether really compilation problem. RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-1984: --- Attachment: YARN-1984.001.patch LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223221#comment-14223221 ] Varun Saxena commented on YARN-1984: Thanks for the review [~jlowe]. I had missed handling DBException at one place and other places didnt handle because the caller method was handling DBException. But in hindsight, I think we should handle the DBException in all the methods you mentioned above, as the method signature doesn't advertise throwing of DBException. I have uploaded a new patch. Kindly review. LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223225#comment-14223225 ] Zhijie Shen commented on YARN-1984: --- Thanks for your effort, Varun and Jason! bq. Is there a reason to have deleteNextEntity throw DBException rather than IOException? It would be cleaner for callers if deleteNextEnttiy handled this. Maybe we don't need to to that. It's consistent to catch DBException at the same method where LeveldbIterator is constructed, but not at the inner method where LeveldbIterator is passed in. The test case should be fine if we need to catch DBException separately when testing a private method. bq. loadVersion can leak the runtime DBException It seems that loadVersion doesn't user iterator. Or LeveldbIterator can help the get method too? LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2897) CrossOriginFilter needs more log statements
Mit Desai created YARN-2897: --- Summary: CrossOriginFilter needs more log statements Key: YARN-2897 URL: https://issues.apache.org/jira/browse/YARN-2897 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai CrossOriginFilter does not log as mcch to make debugging easier -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223225#comment-14223225 ] Zhijie Shen edited comment on YARN-1984 at 11/24/14 6:15 PM: - Thanks for your effort, Varun and Jason! bq. Is there a reason to have deleteNextEntity throw DBException rather than IOException? It would be cleaner for callers if deleteNextEnttiy handled this. Maybe we don't need to to that. It's consistent to catch DBException at the same method where LeveldbIterator is constructed, but not at the inner method where LeveldbIterator is passed in. The test case should be fine if we need to catch DBException separately when testing a private method. bq. loadVersion can leak the runtime DBException It seems that loadVersion doesn't user iterator. Or LeveldbIterator can help the get method too? BTW, handleException can be static and more general to taken one more param: error code, such that it can be reused in more places of this class. was (Author: zjshen): Thanks for your effort, Varun and Jason! bq. Is there a reason to have deleteNextEntity throw DBException rather than IOException? It would be cleaner for callers if deleteNextEnttiy handled this. Maybe we don't need to to that. It's consistent to catch DBException at the same method where LeveldbIterator is constructed, but not at the inner method where LeveldbIterator is passed in. The test case should be fine if we need to catch DBException separately when testing a private method. bq. loadVersion can leak the runtime DBException It seems that loadVersion doesn't user iterator. Or LeveldbIterator can help the get method too? LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223261#comment-14223261 ] Varun Saxena commented on YARN-1984: Thanks [~zjshen] for the review. loadVersion needs to handle DBException because DB#get can throw DBException. I guess handling DBException inside deleteNextEntity is a matter of choice. But as the method advertises throwing only IOException, handling DBException inside the method would avoid any mistakes in future if a developer chooses to call this method and overlooks handling DBException. You are correct. handleException can be changed to static and take one more error code. Will upload a patch with these changes. LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-1984: --- Attachment: YARN-1984.002.patch Made the changes as per review. Kindly review [~jlowe] and [~zjshen] LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.002.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2898) Contaniner-executor prints out wrong error information when failed
[ https://issues.apache.org/jira/browse/YARN-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2898: -- Summary: Contaniner-executor prints out wrong error information when failed (was: Contaniner-executor may fail with wrong information) Contaniner-executor prints out wrong error information when failed -- Key: YARN-2898 URL: https://issues.apache.org/jira/browse/YARN-2898 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor *Settings*: YARN cluster using LinuxContainerExecutor, and the banned.users is left empty in the container-executor.cfg. The default banned list is \{mapred, hdfs, bin\}. *Problem*: let user mapred submit a job, it will fail with ExitCodeException exitCode=139, which is segment fault. This is incorrect, and the correct information should be Requested user mapred is banned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2898) Contaniner-executor prints out wrong error information when failed
[ https://issues.apache.org/jira/browse/YARN-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2898: -- Attachment: YARN-2898-1.patch Contaniner-executor prints out wrong error information when failed -- Key: YARN-2898 URL: https://issues.apache.org/jira/browse/YARN-2898 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2898-1.patch *Settings*: YARN cluster using LinuxContainerExecutor, and the banned.users is left empty in the container-executor.cfg. The default banned list is \{mapred, hdfs, bin\}. *Problem*: let user mapred submit a job, it will fail with ExitCodeException exitCode=139, which is segment fault. This is incorrect, and the correct information should be Requested user mapred is banned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2898) Contaniner-executor may fail with wrong information
Wei Yan created YARN-2898: - Summary: Contaniner-executor may fail with wrong information Key: YARN-2898 URL: https://issues.apache.org/jira/browse/YARN-2898 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor *Settings*: YARN cluster using LinuxContainerExecutor, and the banned.users is left empty in the container-executor.cfg. The default banned list is \{mapred, hdfs, bin\}. *Problem*: let user mapred submit a job, it will fail with ExitCodeException exitCode=139, which is segment fault. This is incorrect, and the correct information should be Requested user mapred is banned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2762: - Attachment: YARN-2762.2.patch RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2898) Contaniner-executor prints out wrong error information when failed
[ https://issues.apache.org/jira/browse/YARN-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223288#comment-14223288 ] Jason Lowe edited comment on YARN-2898 at 11/24/14 6:52 PM: This is a duplicate of YARN-2847. was (Author: jlowe): This is a duplicate of YAN-2847. Contaniner-executor prints out wrong error information when failed -- Key: YARN-2898 URL: https://issues.apache.org/jira/browse/YARN-2898 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2898-1.patch *Settings*: YARN cluster using LinuxContainerExecutor, and the banned.users is left empty in the container-executor.cfg. The default banned list is \{mapred, hdfs, bin\}. *Problem*: let user mapred submit a job, it will fail with ExitCodeException exitCode=139, which is segment fault. This is incorrect, and the correct information should be Requested user mapred is banned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223303#comment-14223303 ] Rohith commented on YARN-2762: -- I updated patch by creating patch from different branch,but I dont see any diffrence between both the patches. Still I attached it.Let wait for jenkins to run! RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2898) Contaniner-executor prints out wrong error information when failed
[ https://issues.apache.org/jira/browse/YARN-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-2898. -- Resolution: Duplicate This is a duplicate of YAN-2847. Contaniner-executor prints out wrong error information when failed -- Key: YARN-2898 URL: https://issues.apache.org/jira/browse/YARN-2898 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2898-1.patch *Settings*: YARN cluster using LinuxContainerExecutor, and the banned.users is left empty in the container-executor.cfg. The default banned list is \{mapred, hdfs, bin\}. *Problem*: let user mapred submit a job, it will fail with ExitCodeException exitCode=139, which is segment fault. This is incorrect, and the correct information should be Requested user mapred is banned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2897) CrossOriginFilter needs more log statements
[ https://issues.apache.org/jira/browse/YARN-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2897: Affects Version/s: 2.6.0 CrossOriginFilter needs more log statements --- Key: YARN-2897 URL: https://issues.apache.org/jira/browse/YARN-2897 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2897.patch CrossOriginFilter does not log as mcch to make debugging easier -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2898) Contaniner-executor prints out wrong error information when failed
[ https://issues.apache.org/jira/browse/YARN-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223291#comment-14223291 ] Wei Yan commented on YARN-2898: --- Oh, yes. Thanks, [~jlowe]. Contaniner-executor prints out wrong error information when failed -- Key: YARN-2898 URL: https://issues.apache.org/jira/browse/YARN-2898 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2898-1.patch *Settings*: YARN cluster using LinuxContainerExecutor, and the banned.users is left empty in the container-executor.cfg. The default banned list is \{mapred, hdfs, bin\}. *Problem*: let user mapred submit a job, it will fail with ExitCodeException exitCode=139, which is segment fault. This is incorrect, and the correct information should be Requested user mapred is banned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-1984: --- Attachment: (was: YARN-1984.002.patch) LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2897) CrossOriginFilter needs more log statements
[ https://issues.apache.org/jira/browse/YARN-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2897: Attachment: YARN-2897.patch Attaching the patch CrossOriginFilter needs more log statements --- Key: YARN-2897 URL: https://issues.apache.org/jira/browse/YARN-2897 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2897.patch CrossOriginFilter does not log as mcch to make debugging easier -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-1984: --- Attachment: YARN-1984.002.patch discardOldEntites does not need to handle DBException if deleteNextEntity handles it. Updated the patch with this change. LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.002.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2897) CrossOriginFilter needs more log statements
[ https://issues.apache.org/jira/browse/YARN-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223336#comment-14223336 ] Hadoop QA commented on YARN-2897: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683387/YARN-2897.patch against trunk revision f636f9d. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5922//console This message is automatically generated. CrossOriginFilter needs more log statements --- Key: YARN-2897 URL: https://issues.apache.org/jira/browse/YARN-2897 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2897.patch CrossOriginFilter does not log as mcch to make debugging easier -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2691) User level API support for priority label
[ https://issues.apache.org/jira/browse/YARN-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2691: - Attachment: YARN-2691.patch Updated the patch by rebasing it. And also fixed comment that ApplicationPriority implements comparable interface. User level API support for priority label - Key: YARN-2691 URL: https://issues.apache.org/jira/browse/YARN-2691 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Sunil G Assignee: Rohith Attachments: YARN-2691.patch, YARN-2691.patch Support for handling Application-Priority label coming from client to ApplicationSubmissionContext. Common api support for user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223351#comment-14223351 ] Hadoop QA commented on YARN-1984: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683377/YARN-1984.002.patch against trunk revision 555fa2d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5919//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5919//console This message is automatically generated. LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.002.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2854) The document about timeline service and generic service needs to be updated
[ https://issues.apache.org/jira/browse/YARN-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223363#comment-14223363 ] Zhijie Shen commented on YARN-2854: --- [~Naganarasimha], thanks of taking this documentation work. I suggest doing the following updates: 1. If you read through the document, the per-framework data (which is actually the basic timeline service) and the generic data (aka the generic history service) sound two equal pieces in this daemon. It may be better to promote the timeline service as the first-class citizen of this document, and then explain the generic history service as the build-in payload of it. 2. We need to update the current status section. Up till now, the essential functionality of the timeline server is done, and it can work in both insecure and secure modes. The generic history service has already ridden on the timeline store. The coming thing may be the scalability and reliability of the timeline service. As the target fix version is set to 2.7.0. We may update it around the end of the release. 2. Add a section about enabling security of the timeline server. I can help with this section if necessary. 3. The configurations about the generic history service needs to be updated. 4. It may be better to enhance the client example code, at least to show how to create a domain, and put the entity into a particular domain. 5. For the API specifications, let's still keep it separately: YARN-1876 The document about timeline service and generic service needs to be updated --- Key: YARN-2854 URL: https://issues.apache.org/jira/browse/YARN-2854 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Naganarasimha G R Priority: Critical Attachments: YARN-2854.20141120-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223364#comment-14223364 ] Hadoop QA commented on YARN-2637: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683370/YARN-2637.9.patch against trunk revision 555fa2d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 13 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5917//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5917//console This message is automatically generated. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.2.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223299#comment-14223299 ] Sangjin Lee commented on YARN-2517: --- It might be conceptually cleaner (and simpler to the clients) to have the clients post the events but not deal with what would happen if the result was not successfully sent. Even the very concept of whether the result was sent is problematic. An ATS writer implementation could make a decision of buffering a certain amount of data before it physically writes it to the backing storage (as optimization). If the client needs to know whether the event was truly sent to the backing storage and also deal with its failure, it may lead to ATS writer implementations leaking to clients and may limit the way an ATS write implementation can be optimized, etc. How about a sync write (as it stands now) for critical data and an async write which is basically fire-and-forget? The understanding there would be the async write is basically a best effort and it would fall on the implementation of ATS writes to try to deliver the events to the backing storage as reliably and optimally as it can (but in theory no guarantees still). Then the async write can even be enabled with a boolean flag. Implement TimelineClientAsync - Key: YARN-2517 URL: https://issues.apache.org/jira/browse/YARN-2517 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2517.1.patch, YARN-2517.2.patch In some scenarios, we'd like to put timeline entities in another thread no to block the current one. It's good to have a TimelineClientAsync like AMRMClientAsync and NMClientAsync. It can buffer entities, put them in a separate thread, and have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223369#comment-14223369 ] Eric Payne commented on YARN-1963: -- Hi [~sunilg]. Thanks for the work you are doing on this issue. bq. {{yarn.scheduler.capacity.root.queue_name.priority_label.acl}} If this property doesn't exist, will queue admins still be able to change priorities of jobs in the queue? Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: YARN Application Priorities Design.pdf, YARN Application Priorities Design_01.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223386#comment-14223386 ] Hadoop QA commented on YARN-2762: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683382/YARN-2762.2.patch against trunk revision f636f9d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5920//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5920//console This message is automatically generated. RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223401#comment-14223401 ] Hadoop QA commented on YARN-1984: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683388/YARN-1984.002.patch against trunk revision f636f9d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5921//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5921//console This message is automatically generated. LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.002.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2691) User level API support for priority label
[ https://issues.apache.org/jira/browse/YARN-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223468#comment-14223468 ] Hadoop QA commented on YARN-2691: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683391/YARN-2691.patch against trunk revision 380a361. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1229 javac compiler warnings (more than the trunk's current 1219 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5923//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5923//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5923//console This message is automatically generated. User level API support for priority label - Key: YARN-2691 URL: https://issues.apache.org/jira/browse/YARN-2691 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Sunil G Assignee: Rohith Attachments: YARN-2691.patch, YARN-2691.patch Support for handling Application-Priority label coming from client to ApplicationSubmissionContext. Common api support for user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2899) Run TestDockerContainerExecutorWithMocks on Linux only
Ming Ma created YARN-2899: - Summary: Run TestDockerContainerExecutorWithMocks on Linux only Key: YARN-2899 URL: https://issues.apache.org/jira/browse/YARN-2899 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Priority: Minor It seems the test should strictly check for Linux, otherwise, it will fail when the OS isn't Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2009) Priority support for preemption in ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223490#comment-14223490 ] Eric Payne commented on YARN-2009: -- If we are to choose the less complicated route, I believe that, at the very least, when {{ProportionalCapacityPreemptionPolicy}} determines that {{queueA}} needs to give up some containers, it should first select containers from the lowest priority apps. Priority support for preemption in ProportionalCapacityPreemptionPolicy --- Key: YARN-2009 URL: https://issues.apache.org/jira/browse/YARN-2009 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Devaraj K Assignee: Sunil G While preempting containers based on the queue ideal assignment, we may need to consider preempting the low priority application containers first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2899) Run TestDockerContainerExecutorWithMocks on Linux only
[ https://issues.apache.org/jira/browse/YARN-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-2899: -- Attachment: YARN-2899.patch Run TestDockerContainerExecutorWithMocks on Linux only -- Key: YARN-2899 URL: https://issues.apache.org/jira/browse/YARN-2899 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Priority: Minor Attachments: YARN-2899.patch It seems the test should strictly check for Linux, otherwise, it will fail when the OS isn't Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223511#comment-14223511 ] Jason Lowe commented on YARN-1984: -- +1 latest patch lgtm. [~zjshen] do you have further comments? LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.002.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
Jonathan Eagles created YARN-2900: - Summary: Application Not Found in AHS throws Internal Server Error with NPE Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223550#comment-14223550 ] Zhijie Shen commented on YARN-1984: --- LGTM LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.002.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223552#comment-14223552 ] Jonathan Eagles commented on YARN-2900: --- Application not found in the history store should be a normal case and not an exceptional in the REST api case since the application id is user provided information. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-2900: -- Description: Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223557#comment-14223557 ] Zhijie Shen commented on YARN-2900: --- [~jeagles], it seems that you're still using ApplicationHistoryManagerImpl and the old application history store. {code} org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) {code} Otherwise, you should see ApplicationHistoryManagerOnTimelineStore instead. Per discussion in [YARN-2900|https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073], we will no longer support the old storage stack. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223562#comment-14223562 ] Zhijie Shen commented on YARN-2900: --- BTW, it's a known issue. and I've filed a ticket before: YARN-1835 Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223557#comment-14223557 ] Zhijie Shen edited comment on YARN-2900 at 11/24/14 9:38 PM: - [~jeagles], it seems that you're still using ApplicationHistoryManagerImpl and the old application history store. {code} org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) {code} Otherwise, you should see ApplicationHistoryManagerOnTimelineStore instead. Per discussion in [YARN-2033|https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073], we will no longer support the old storage stack. was (Author: zjshen): [~jeagles], it seems that you're still using ApplicationHistoryManagerImpl and the old application history store. {code} org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) {code} Otherwise, you should see ApplicationHistoryManagerOnTimelineStore instead. Per discussion in [YARN-2900|https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073], we will no longer support the old storage stack. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2188) Client service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2188: --- Attachment: YARN-2188-trunk-v5.patch [~kasha] v5 attached. Link to a diff between v4 and v5: https://github.com/ctrezzo/hadoop/commit/99b80eba32af42d8032fa47e58e4c1068f2707e4 Thanks! Client service for cache manager Key: YARN-2188 URL: https://issues.apache.org/jira/browse/YARN-2188 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2188-trunk-v1.patch, YARN-2188-trunk-v2.patch, YARN-2188-trunk-v3.patch, YARN-2188-trunk-v4.patch, YARN-2188-trunk-v5.patch Implement the client service for the shared cache manager. This service is responsible for handling client requests to use and release resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2900: Attachment: YARN-2900.patch Attaching the patch that checks for null and returns appropriate result so that we can catch the NotFoundException in WebServices.java Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223621#comment-14223621 ] Jonathan Eagles commented on YARN-2900: --- [~zjshen], please don't jump to any conclusions. This is my setup, which I believe is a supported configuration for 2.6.0. {quote} yarn.timeline-service.generic-application-history.enabled=false yarn.timeline-service.generic-application-history.store-class=org.apache.hadoop.yarn.server.applicationhistoryservice.NullApplicationHistoryStore {quote} The Tez UI make applicationhistory rest api calls to gather fine details for those who have it enabled. In my case where generic history is disabled, it is causing massive flooding of log files. As far as not finding the duplicate JIRA, I was unable to find this issue in the search. Try to include details that are searchable (stack track, logs, class/file names) so that users are able to find the appropriate issue. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223624#comment-14223624 ] Hadoop QA commented on YARN-2900: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683426/YARN-2900.patch against trunk revision 2967c17. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5926//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5926//console This message is automatically generated. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2188) Client service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223632#comment-14223632 ] Hadoop QA commented on YARN-2188: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683425/YARN-2188-trunk-v5.patch against trunk revision 2967c17. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5925//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5925//console This message is automatically generated. Client service for cache manager Key: YARN-2188 URL: https://issues.apache.org/jira/browse/YARN-2188 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2188-trunk-v1.patch, YARN-2188-trunk-v2.patch, YARN-2188-trunk-v3.patch, YARN-2188-trunk-v4.patch, YARN-2188-trunk-v5.patch Implement the client service for the shared cache manager. This service is responsible for handling client requests to use and release resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223671#comment-14223671 ] Jonathan Eagles commented on YARN-2900: --- I do see this in the log file that is suspicious now that I am looking at the code. 2014-11-24 22:12:42,107 [main] WARN applicationhistoryservice.ApplicationHistoryServer: The filesystem based application history store is deprecated. Looking into this. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223676#comment-14223676 ] Jonathan Eagles commented on YARN-2900: --- Issue is spacing in the config file. Here is the updated stack trace. {quote} 2014-11-24 22:34:53,900 [17694135@qtp-11347161-6] WARN webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR javax.ws.rs.WebApplicationException: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: The entity for application application_1416586084624_0011 doesn't exist in the timeline store at org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:452) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:227) at org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices.getApp(AHSWebServices.java:95) Caused by: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: The entity for application application_1416586084624_0011 doesn't exist in the timeline store at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplication(ApplicationHistoryManagerOnTimelineStore.java:542) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplication(ApplicationHistoryManagerOnTimelineStore.java:94) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more {quote} Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223679#comment-14223679 ] Zhijie Shen commented on YARN-2900: --- bq. This is my setup, which I believe is a supported configuration for 2.6.0. Yeah, the configuration should be supported. However, with the configuration, ApplicationHistoryManagerOnTimelineStore should be used instead. Here's the related code in ApplicationHistoryServer. {code} private ApplicationHistoryManager createApplicationHistoryManager( Configuration conf) { // Backward compatibility: // APPLICATION_HISTORY_STORE is neither null nor empty, it means that the // user has enabled it explicitly. if (conf.get(YarnConfiguration.APPLICATION_HISTORY_STORE) == null || conf.get(YarnConfiguration.APPLICATION_HISTORY_STORE).length() == 0 || conf.get(YarnConfiguration.APPLICATION_HISTORY_STORE).equals( NullApplicationHistoryStore.class.getName())) { return new ApplicationHistoryManagerOnTimelineStore( timelineDataManager, aclsManager); } else { LOG.warn(The filesystem based application history store is deprecated.); return new ApplicationHistoryManagerImpl(); } } {code} I tested this config locally. It seems that the new ApplicationHistoryManagerOnTimelineStore was picked. If it doesn't pick the right manager, then it is really bad. But given ApplicationHistoryManagerOnTimelineStore is picked, we shouldn't see NPE exception in the description. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1984) LeveldbTimelineStore does not handle db exceptions properly
[ https://issues.apache.org/jira/browse/YARN-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223689#comment-14223689 ] Hudson commented on YARN-1984: -- FAILURE: Integrated in Hadoop-trunk-Commit #6597 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6597/]) YARN-1984. LeveldbTimelineStore does not handle db exceptions properly. Contributed by Varun Saxena (jlowe: rev 1ce4d33c2dc86d711b227a04d2f9a2ab696a24a1) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java LeveldbTimelineStore does not handle db exceptions properly --- Key: YARN-1984 URL: https://issues.apache.org/jira/browse/YARN-1984 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-1984.001.patch, YARN-1984.002.patch, YARN-1984.patch The org.iq80.leveldb.DB and DBIterator methods throw runtime exceptions rather than IOException which can easily leak up the stack and kill threads (e.g.: the deletion thread). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223691#comment-14223691 ] Zhijie Shen commented on YARN-2900: --- bq. Issue is spacing in the config file. Here is the updated stack trace. Then, this log message was expected when the app is not found. But INTERNAL_SERVER_ERROR is bad. We should return NOT_FOUND instead. We can capture NotFoundException and convert it to webapp.NotFoundException. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223694#comment-14223694 ] Jonathan Eagles commented on YARN-2900: --- FYI: Here is the config that was causing the original failure. Notice the newline as part of the value. {quote} property descriptionStore class name for history store, defaulting to file system store/description nameyarn.timeline-service.generic-application-history.store-class/name valueorg.apache.hadoop.yarn.server.applicationhistoryservice.NullApplicationHistoryStore /value /property {quote} Internal System Error is still happens with ApplicationHistoryManagerOnTimelineStore which this issues now tracks. Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2679: Labels: metrics supportability (was: ) Add metric for container launch duration Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2679.000.patch, YARN-2679.001.patch, YARN-2679.002.patch add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2675) the containersKilled metrics is not updated when the container is killed during localization.
[ https://issues.apache.org/jira/browse/YARN-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2675: Labels: metrics supportability (was: ) the containersKilled metrics is not updated when the container is killed during localization. - Key: YARN-2675 URL: https://issues.apache.org/jira/browse/YARN-2675 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-2675.000.patch, YARN-2675.001.patch, YARN-2675.002.patch, YARN-2675.003.patch, YARN-2675.004.patch, YARN-2675.005.patch The containersKilled metrics is not updated when the container is killed during localization. We should add KILLING state in finished of ContainerImpl.java to update killedContainer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2802) ClusterMetrics to include AM launch and register delays
[ https://issues.apache.org/jira/browse/YARN-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2802: Labels: metrics supportability (was: ) ClusterMetrics to include AM launch and register delays --- Key: YARN-2802 URL: https://issues.apache.org/jira/browse/YARN-2802 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Fix For: 2.7.0 Attachments: YARN-2802.000.patch, YARN-2802.001.patch, YARN-2802.002.patch, YARN-2802.003.patch, YARN-2802.004.patch, YARN-2802.005.patch add AM container launch and register delay metrics in QueueMetrics to help diagnose performance issue. Added two metrics in QueueMetrics: aMLaunchDelay: the time spent from sending event AMLauncherEventType.LAUNCH to receiving event RMAppAttemptEventType.LAUNCHED in RMAppAttemptImpl. aMRegisterDelay: the time waiting from receiving event RMAppAttemptEventType.LAUNCHED to receiving event RMAppAttemptEventType.REGISTERED(ApplicationMasterService#registerApplicationMaster) in RMAppAttemptImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223734#comment-14223734 ] Mit Desai commented on YARN-2900: - bq. We can capture NotFoundException and convert it to webapp.NotFoundException [~zjshen] How about we just return a null if the entity == null in ApplicationHistoryManagerOnTimelineStore? That way, when the call returns to WebServices, it will throw NotFoundException form its current implementation. This is the exact same approach that is used in the patch that I submitted earlier Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2697) RMAuthenticationHandler is no longer useful
[ https://issues.apache.org/jira/browse/YARN-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223735#comment-14223735 ] Zhijie Shen commented on YARN-2697: --- +1. Remove useless code path. It should be okay with tests. Will commit the patch RMAuthenticationHandler is no longer useful --- Key: YARN-2697 URL: https://issues.apache.org/jira/browse/YARN-2697 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: haosdent Attachments: YARN-2697.patch After YARN-2656, RMAuthenticationHandler is no longer useful, because authentication mechanism is reusing the common DT auth filter stack. It should be safe to remove this unused code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2697) RMAuthenticationHandler is no longer useful
[ https://issues.apache.org/jira/browse/YARN-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223748#comment-14223748 ] Hudson commented on YARN-2697: -- FAILURE: Integrated in Hadoop-trunk-Commit #6598 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6598/]) YARN-2697. Remove useless RMAuthenticationHandler. Contributed by Haosong Huang. (zjshen: rev e37a4ff0c1712a1cb80e0412ec53a5d10b8d30f9) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMAuthenticationHandler.java RMAuthenticationHandler is no longer useful --- Key: YARN-2697 URL: https://issues.apache.org/jira/browse/YARN-2697 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: haosdent Attachments: YARN-2697.patch After YARN-2656, RMAuthenticationHandler is no longer useful, because authentication mechanism is reusing the common DT auth filter stack. It should be safe to remove this unused code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application Not Found in AHS throws Internal Server Error with NPE
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223767#comment-14223767 ] Zhijie Shen commented on YARN-2900: --- bq. How about we just return a null if the entity == null in ApplicationHistoryManagerOnTimelineStore? This is fine, too. It should also benefit web UI. But please make sure that in ApplicationHistoryClientService, if Report == null, then throw XxxxNotFoundException. The initial consideration of not returning null, but throwing XXXNotFoundException is to be consistent with ClientRMServices. Thanks! Application Not Found in AHS throws Internal Server Error with NPE -- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2900: -- Summary: Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) (was: Application Not Found in AHS throws Internal Server Error with NPE) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223772#comment-14223772 ] Swapnil Daingade commented on YARN-2139: +1 for having an abstract policy to wrap spindles / disk affinity / iops / bandwidth, etc. [Umbrella] Support for Disk as a Resource in YARN -- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, YARN-2139-prototype.patch YARN should consider disk as another resource for (1) scheduling tasks on nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223781#comment-14223781 ] Mit Desai commented on YARN-2900: - Thanks! I will post the updated patch following the discussion here. You can review it once its up. Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
[ https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-1996: -- Attachment: YARN-1996-2.patch [~jira.shegalov], [~maysamyabandeh] and I identified the root cause of https://issues.apache.org/jira/browse/MAPREDUCE-6043 and came up with the updated patch to address that scenario. MRAppMaster's RMContainerAllocator depends on RM's CompletedContainers messages to make allocation request. In some corner cases when the node becomes unhealthy, CompletedContainers messages might be lost. The new patch makes sure RM will deliver CompletedContainers messages to AM in the following scenarios. * When NM delivers unhealthy and completed containers notifications in the same heartbeat to RM. * NM becomes unhealthy first, then it restarts. * NM becomes unhealthy first, then it becomes healthy. * NM becomes unhealthy first, then RM asks it to reboot. * NM becomes unhealthy first, then it is decommissioned. * NM becomes unhealthy first, then RM lost it. For work preserving RM restart, an unhealthy NM will first be transitioned to RUNNING state after RM restart, and then to UNHEALTHY state. So if the RM restarts while it is draining unhealthy nodes, it should be able to continue to drain unhealthy nodes after the restart. Appreciate any input on this. Provide alternative policies for UNHEALTHY nodes. - Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1996-2.patch, YARN-1996.v01.patch Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unhealthy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223827#comment-14223827 ] Sunil G commented on YARN-1963: --- Thank you Wangda and [~eepayne] for the comments. ACL, if configured for a queue, will be considered before submitting the job. If there are no configuration, only queue ACL will be checked which is same as what is happening now. priority label level ACL is on top of queue level ACL which is extra and can be configured as needed by admin. Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: YARN Application Priorities Design.pdf, YARN Application Priorities Design_01.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2009) Priority support for preemption in ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223860#comment-14223860 ] Sunil G commented on YARN-2009: --- Thank you [~curino] for the thoughts. I understand the complexity for the user when they experience preemption of few containers and that itself may be tougher for them to understand why that container is preempted and reasons for it. In a simple way, if timestamp is only considered (to an extent user-limit factor also now), that itself will be tougher to express through logs. Hence solving some of these small imbalances like what I mentioned may not help much the users in a big level. Based on use cases we can check whether these are needed later. Coming to the focus of this JIRA, within a queue if slow and low priority applications are running and consuming full resources, it will be good if we can make some space by preempting lower priority ones. This preemption can be done within a Queue. we have seen some lower priority applications are taking more cluster and higher priority applications are waiting to launch for long time. Please suggest your thoughts on this. Priority support for preemption in ProportionalCapacityPreemptionPolicy --- Key: YARN-2009 URL: https://issues.apache.org/jira/browse/YARN-2009 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Devaraj K Assignee: Sunil G While preempting containers based on the queue ideal assignment, we may need to consider preempting the low priority application containers first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2774) shared cache service should authorize calls properly
[ https://issues.apache.org/jira/browse/YARN-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2774: --- Description: The shared cache manager (SCM) services should authorize calls properly. Currently, the uploader service (done in YARN-2186) does not authorize calls to notify the SCM on newly uploaded resource. Proper security/authorization needs to be done in this RPC call. Also, the use/release calls (done in YARN-2188) and the scmAdmin commands (done in YARN-2189) are not properly authorized. was: The shared cache manager (SCM) services should authorize calls properly. Currently, the uploader service (done in YARN-2186) does not authorize calls to notify the SCM on newly uploaded resource. Proper security/authorization needs to be done in this RPC call. Also, the use/release calls (done in YARN-2188) are not properly authorized. shared cache service should authorize calls properly Key: YARN-2774 URL: https://issues.apache.org/jira/browse/YARN-2774 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sangjin Lee The shared cache manager (SCM) services should authorize calls properly. Currently, the uploader service (done in YARN-2186) does not authorize calls to notify the SCM on newly uploaded resource. Proper security/authorization needs to be done in this RPC call. Also, the use/release calls (done in YARN-2188) and the scmAdmin commands (done in YARN-2189) are not properly authorized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
[ https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223917#comment-14223917 ] Hadoop QA commented on YARN-1996: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683446/YARN-1996-2.patch against trunk revision 8caf537. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1223 javac compiler warnings (more than the trunk's current 1219 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5927//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5927//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5927//console This message is automatically generated. Provide alternative policies for UNHEALTHY nodes. - Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1996-2.patch, YARN-1996.v01.patch Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unhealthy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223923#comment-14223923 ] Zhijie Shen commented on YARN-2517: --- Thanks for sharing your idea, Hitesh, Mit and Sangjin! I'd like to make some clarification. The error that the handler wants to take care of is not the communication problem, but the problem that happens when the server is processing the posted timeline entity (See TimelinePutResponse). It could be the data integrity issue that is messed up by the app. For example, the posted Entity A in Domain 1 is trying to relate to Entity B in Domain 2. It's fine if in some use cases, the app doesn't want to care about it, and consequently doesn't require an Ack. The app can just go ahead without providing the handler. However, it's still better to be generic enough to cover the other use cases where the app wants to make sure it has the timeline data persisted, or at least know it is succeeded or not. Queueing/messaging layer may help to mitigate the communication problem, but aforementioned data problem still could happen, such that the app may still want to hear about the put response. However, it sounds right that the implementation of this layer will affect that of async call. It makes sense to wait until we make it clear how to make client - TS communication reliable. If we eventually find handler in async call is difficult and will further prevent optimization, a sync write (as it stands now) for critical data and an async write which is basically fire-and-forge sounds a reasonable alternative plan. Implement TimelineClientAsync - Key: YARN-2517 URL: https://issues.apache.org/jira/browse/YARN-2517 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2517.1.patch, YARN-2517.2.patch In some scenarios, we'd like to put timeline entities in another thread no to block the current one. It's good to have a TimelineClientAsync like AMRMClientAsync and NMClientAsync. It can buffer entities, put them in a separate thread, and have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2025) Possible NPE in schedulers#addApplicationAttempt()
[ https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223993#comment-14223993 ] Tsuyoshi OZAWA commented on YARN-2025: -- Thanks for your point, [~rohithsharma]. I'll take a look. Possible NPE in schedulers#addApplicationAttempt() -- Key: YARN-2025 URL: https://issues.apache.org/jira/browse/YARN-2025 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2025.1.patch In FifoScheduler/FairScheduler/CapacityScheduler#addApplicationAttempt(), we don't check whether {{application}} is null. This can cause NPE in following sequences: addApplication() - doneApplication() (e.g. AppKilledTransition) - addApplicationAttempt(). {code} SchedulerApplication application = applications.get(applicationAttemptId.getApplicationId()); String user = application.getUser(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224012#comment-14224012 ] Jian He commented on YARN-2404: --- looks good, one minor thing: we could just do return after checking attemptTokens == null {code} if(attemptTokens == null) { builder.clearAppAttemptTokens(); } {code} Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Tsuyoshi OZAWA Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2637: -- Attachment: YARN-2637.12.patch Added test specific to changed behavior, all existing tests should still pass, this patch should be ready for review. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.2.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2025) Possible NPE in schedulers#addApplicationAttempt()
[ https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224028#comment-14224028 ] Jian He commented on YARN-2025: --- bq. This looks very strange that how can RMApp-FAILED but RMAppAttempt-null..? YARN-2834 should fix this. [~rohithsharma], Are you running a build with the patch or without ? Possible NPE in schedulers#addApplicationAttempt() -- Key: YARN-2025 URL: https://issues.apache.org/jira/browse/YARN-2025 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2025.1.patch In FifoScheduler/FairScheduler/CapacityScheduler#addApplicationAttempt(), we don't check whether {{application}} is null. This can cause NPE in following sequences: addApplication() - doneApplication() (e.g. AppKilledTransition) - addApplicationAttempt(). {code} SchedulerApplication application = applications.get(applicationAttemptId.getApplicationId()); String user = application.getUser(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224034#comment-14224034 ] Craig Welch commented on YARN-2637: --- One open question still in my mind is whether or not the configuration parameter should be changed to actually behave as a percent. Other things so named (userlimit, at least) are actually a percentage - and the name of this parameter tends to suggest that - but it is actually just a float value (so you would use .1 to limit to 10 percent of cluster resource, not 10...). I did take a pass at making the change, it looks doable (with quite a few more test changes...). On the one hand, it seems like the time to make this change, as the meaning of the value is changing considerably as it is. On the other hand, it may be more impact than we want - as users who have configured, say, .3, will still have about the same behavior on a sizable cluster as they do today with the change as it is now, but if we modify it to actually behave as a percent value (e.g. / 100), then it will have a far more limiting impact (if the users do not adjust their configuration). Thoughts? Myself, I can see arguments both ways, though I'm leaning toward making the change to remove all of the surprise factor of how this parameter works... (e.g. make it a proper % value) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.2.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)