[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First
[ https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995996#comment-13995996 ] Maysam Yabandeh commented on YARN-1969: --- Talked to [~jira.shegalov] offline. This would indeed allow RM to make more efficient scheduling decisions. This seems to be a good candidate for phase 2 of this jira. Fair Scheduler: Add policy for Earliest Deadline First -- Key: YARN-1969 URL: https://issues.apache.org/jira/browse/YARN-1969 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, what we require is a kind of variation of *Earliest Deadline First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to *only within a queue*: i.e., adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues to use it by setting the schedulingPolicy field. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2049) Delegation token stuff for the timeline sever
Zhijie Shen created YARN-2049: - Summary: Delegation token stuff for the timeline sever Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-1408: - Assignee: Sunil G Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996083#comment-13996083 ] Min Zhou commented on YARN-2048: [~zjshen] hmm... from your patch, it's indeed a duplicate. Please go ahead. List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995819#comment-13995819 ] Jian He commented on YARN-1368: --- Hi [~adhoot], thanks for working on the FS changes, can you please separate it and upload onto YARN-1370 for FS specific change? I already have a local patch which changes quite a bit from the latest patch uploaded here. And also YARN-2017 is likely to go in first which again conflicts quite a bit with the patch here. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2032) Implement a scalable, available TimelineStore using HBase
Vinod Kumar Vavilapalli created YARN-2032: - Summary: Implement a scalable, available TimelineStore using HBase Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1936: -- Attachment: YARN-1936.1.patch I created a patch: 1. It makes use of the hadoop-auth module and YARN-2049 to talk to the timeline server with either kerberos authentication or delegation token: 2. I creates a main method, which allows users to upload the timeline data in a JSON file from command line. 3. When using YarnClient to submit an application, if the authentication is enabled, YarnClient is going to check whether app submission context has the timeline DT or not. If not, it will add the DT to the context, such that when AM uses TimelineClient, it can use the DT for authentication, as it can not use kerberos instead. Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1936.1.patch TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996165#comment-13996165 ] Wangda Tan commented on YARN-2017: -- Hi Jian, Thanks for your efforts on this patch, some comments, 1) SchedulerNode.java {code} private synchronized void deductAvailableResource(Resource resource) { if (resource == null) { LOG.error(Invalid deduction of null resource for + rmNode.getNodeAddress()); {code} Since this is original logic of SchedulerNode, I think it's better to throw exception instead of print a irresponsible log here. Null object passed in should be considered big problem in scheduler. And several following places in this class. 2) SchedulerNode.java {code} + private synchronized boolean isValidContainer(Container c) { +if (launchedContainers.containsKey(c.getId())) + return true; +return false; + } {code} Better add {..} following if 3) SchedulerNode.java {code} + public synchronized RMContainer getReservedContainer() { +return reservedContainer; + } {code} I think it's better to add a setReservedContainer(...) instead of manipulating super.reservedContainer in its sub classes. And change protected reservedContainer to private 4) In YarnScheduler.java {code} + /** + * Get the whole resource capacity of the cluster. + * @return the whole resource capacity of the cluster. + */ + @LimitedPrivate(yarn) + @Unstable + public Resource getClusterResource(); {code} I'm wondering if it is meaningful to merge this method, too much code changes due to this merge. I found there're no common logic (like SchedulerNode/SchedulerAppAttempt) use it. 5) In FairScheduler.java {code} + protected FSSchedulerApp getCurrentAttemptForContainer(ContainerId containerId) { +return (FSSchedulerApp) super.getCurrentAttemptForContainer(containerId); } {code} I understand this is a adaptor, I agree with [~sandyr] about using generic to eliminate such type casting? Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1902: -- Target Version/s: 2.5.0 (was: 2.3.0) Labels: client (was: patch) Allocation of too many containers when a second request is done with the same resource capability - Key: YARN-1902 URL: https://issues.apache.org/jira/browse/YARN-1902 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0, 2.3.0, 2.4.0 Reporter: Sietse T. Au Labels: client Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of MapResource, ResourceRequestInfo is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996055#comment-13996055 ] Zhijie Shen commented on YARN-2048: --- Is this duplicate of YARN-1809? In YARN-1809, one change is to make RM web list containers as well. List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993762#comment-13993762 ] Jason Lowe commented on YARN-1515: -- I apologize for the long delay in reviewing and resulting upmerge it caused. Patch looks good to me with just some minor comments: - StopContainerRequest#getDumpThreads#getDumpThreads should have javadocs and interface annotations like the other methods - Why is StopContainersRequest#getStopRequests marked Unstable but setStopRequests is Stable? - Nit: dumpThreads is an event-specific field, would be nice to have an AMLauncherCleanupEvent that takes just the app attempt in the constructor and derives from AMLauncherEvent. Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: New Feature Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-322) Add cpu information to queue metrics
[ https://issues.apache.org/jira/browse/YARN-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995093#comment-13995093 ] Nathan Roberts commented on YARN-322: - Arun, does this patch address what you were looking for? Happy to adjust if not. Add cpu information to queue metrics Key: YARN-322 URL: https://issues.apache.org/jira/browse/YARN-322 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, scheduler Affects Versions: 2.4.0 Reporter: Arun C Murthy Assignee: Nathan Roberts Fix For: 2.5.0 Attachments: YARN-322.patch, YARN-322.patch Post YARN-2 we need to add cpu information to queue metrics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995824#comment-13995824 ] Wangda Tan commented on YARN-2048: -- Hi [~coderplay], +1 for this idea, it should be very helpful to debug YARN applications. I took a look at your patch, some comments, 1) When AM restarted, all containers will be copied to new attempt's container list, which might lead user confuse why new attempt has all containers from old attempts 2) You might need consider following JIRAs to make app-containers page included all expected containers, YARN-556,YARN-1885,YARN-1489 List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995826#comment-13995826 ] Hadoop QA commented on YARN-2017: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644502/YARN-2017.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3740//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3740//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3740//console This message is automatically generated. Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First
[ https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996101#comment-13996101 ] Karthik Kambatla commented on YARN-1969: I am a little confused. Is the original intention of the JIRA to do Earliest Endtime First, and *not* Earliest Deadline First? Fair Scheduler: Add policy for Earliest Deadline First -- Key: YARN-1969 URL: https://issues.apache.org/jira/browse/YARN-1969 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, what we require is a kind of variation of *Earliest Deadline First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to *only within a queue*: i.e., adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues to use it by setting the schedulingPolicy field. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-998) Persistent resource change during NM/RM restart
[ https://issues.apache.org/jira/browse/YARN-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenji Kikushima updated YARN-998: - Attachment: YARN-998-sample.patch Attached file is sample implementation to persist ResourceOption on RM. Please refer to it if you have interest. I'm okay that leave it until proper timing. Thanks. - This patch needs YARN-1911.patch to avoid NPE - Sorry this patch supports RM restart only - It possibly affect scalability on large cluster because of using XML to persist ResourceOption Persistent resource change during NM/RM restart --- Key: YARN-998 URL: https://issues.apache.org/jira/browse/YARN-998 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-998-sample.patch When NM is restarted by plan or from a failure, previous dynamic resource setting should be kept for consistency. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996126#comment-13996126 ] Min Zhou commented on YARN-2048: [~zjshen] Currently, the only implementation of ApplcationContext is ApplicationHistoryManagerImpl, which retrieves containers information from history store. Quesions: # How do you fetch the containers info from a historyserver and display it on the RM web? # If the information is from history store, seems RM won't get that kind of info until the application is done? Sometimes user's application might be a long-live application, never finish unless user kill it. # Seems the only way providing containers info to RM is to maintain a list in RMAppAttempImpl, which was my way as well. Min List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996062#comment-13996062 ] Wangda Tan commented on YARN-2048: -- Hi [~zjshen], I think these two JIRAs covers similar issues. I took a quick look at your patch, I found you haven't changed RMAppAttempt/RMApp/RMContainer, so could you please elaborate a little about what you did to get containers from an application? Ask this because I'm thinking [~coderplay]'s patch can be considered as complementary of your solution if you haven't implemented it. List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995836#comment-13995836 ] Jian He commented on YARN-2017: --- The findbugs warning should not be a problem, will suppress it. Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995863#comment-13995863 ] Min Zhou commented on YARN-2048: Thanks [~leftnoteasy], I will take a look on the those stuff you mentioned and resubmit another one. List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1976) Tracking url missing http protocol for FAILED application
[ https://issues.apache.org/jira/browse/YARN-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995930#comment-13995930 ] Junping Du commented on YARN-1976: -- Thanks [~jianhe] for review and commit! Tracking url missing http protocol for FAILED application - Key: YARN-1976 URL: https://issues.apache.org/jira/browse/YARN-1976 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Assignee: Junping Du Fix For: 2.4.1 Attachments: YARN-1976-v2.patch, YARN-1976.patch Run yarn application -list -appStates FAILED, It does not print http protocol name like FINISHED apps. {noformat} -bash-4.1$ yarn application -list -appStates FINISHED,FAILED,KILLED 14/04/15 23:55:07 INFO client.RMProxy: Connecting to ResourceManager at host Total number of applications (application-types: [] and states: [FINISHED, FAILED, KILLED]):4 Application-IdApplication-Name Application-Type User Queue State Final-State ProgressTracking-URL application_1397598467870_0004 Sleep job MAPREDUCEhrt_qa defaultFINISHED SUCCEEDED 100% http://host:19888/jobhistory/job/job_1397598467870_0004 application_1397598467870_0003 Sleep job MAPREDUCEhrt_qa defaultFINISHED SUCCEEDED 100% http://host:19888/jobhistory/job/job_1397598467870_0003 application_1397598467870_0002 Sleep job MAPREDUCEhrt_qa default FAILED FAILED 100% host:8088/cluster/app/application_1397598467870_0002 application_1397598467870_0001 word count MAPREDUCEhrt_qa defaultFINISHED SUCCEEDED 100% http://host:19888/jobhistory/job/job_1397598467870_0001 {noformat} It only prints 'host:8088/cluster/app/application_1397598467870_0002' instead 'http://host:8088/cluster/app/application_1397598467870_0002' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993895#comment-13993895 ] Karthik Kambatla commented on YARN-556: --- For the scheduler-related work itself, the offline sync up thought it would be best to move as much common code as possible to AbstractYarnScheduler. To unblock the restart work at the earliest, we should do it in two phases - the first phase that only pulls out stuff that would make it easier to handle the recovery, and a more comprehensive re-jig later. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993864#comment-13993864 ] Vinod Kumar Vavilapalli commented on YARN-2032: --- Sure, as I can see, you already took it over while the mailing list is down :) Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1922) Process group remains alive after container process is killed externally
[ https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993821#comment-13993821 ] Hadoop QA commented on YARN-1922: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644136/YARN-1922.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3727//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3727//console This message is automatically generated. Process group remains alive after container process is killed externally Key: YARN-1922 URL: https://issues.apache.org/jira/browse/YARN-1922 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Environment: CentOS 6.4 Reporter: Billie Rinaldi Assignee: Billie Rinaldi Attachments: YARN-1922.1.patch, YARN-1922.2.patch, YARN-1922.3.patch If the main container process is killed externally, ContainerLaunch does not kill the rest of the process group. Before sending the event that results in the ContainerLaunch.containerCleanup method being called, ContainerLaunch sets the completed flag to true. Then when cleaning up, it doesn't try to read the pid file if the completed flag is true. If it read the pid file, it would proceed to send the container a kill signal. In the case of the DefaultContainerExecutor, this would kill the process group. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) Container ID format and clustertimestamp for Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996355#comment-13996355 ] Tsuyoshi OZAWA commented on YARN-2052: -- We can have 2 options for now: 1. Changing container Id format. ContainerId should be an opaque string that YARN app developers don't take a dependency on. 2. Preserving container Id format. RM restart Phase 2 should be transparent from YARN users. Container ID format and clustertimestamp for Work preserving restart Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA We've been discussing whether container id format is changed to include cluster timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking the discussion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart
[ https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996391#comment-13996391 ] Junping Du commented on YARN-1362: -- Thanks for the patch, [~jlowe]! I just start to work on rolling upgrade so need to understand the preserving work in NM restart. One question here: do we expect all shutdown ops on NM will hint it will get restart soon? If not, the work will be preserved as isDecommssioned is set to false by default. Is that the behavior we expect? Distinguish between nodemanager shutdown for decommission vs shutdown for restart - Key: YARN-1362 URL: https://issues.apache.org/jira/browse/YARN-1362 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1362.patch When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhou updated YARN-2048: --- Description: Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon was: Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which node those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Reporter: Min Zhou Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-182) Unnecessary Container killed by the ApplicationMaster message for successful containers
[ https://issues.apache.org/jira/browse/YARN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996398#comment-13996398 ] Jason Lowe commented on YARN-182: - bq. In my case the reducers were moved to COMPLETED state after 22 mins, they had reached 100% progress at 15 mins. Having progress reach 100% but the task not completing for 7 more minutes is an unrelated issue. Check your reducer logs and/or the input format which is responsible for setting the progress. This is probably a question better suited for the u...@hadoop.apache.org mailing list. Unnecessary Container killed by the ApplicationMaster message for successful containers - Key: YARN-182 URL: https://issues.apache.org/jira/browse/YARN-182 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.1-alpha Reporter: zhengqiu cai Assignee: Omkar Vinit Joshi Labels: hadoop, usability Attachments: Log.txt I was running wordcount and the resourcemanager web UI shown the status as FINISHED SUCCEEDED, but the log shown Container killed by the ApplicationMaster -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart
[ https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996395#comment-13996395 ] Jason Lowe commented on YARN-1362: -- Yes, that's the intended behavior. If ops is shutting down the NM and not expecting it to return anytime soon then it should be decommissioned from the RM. Distinguish between nodemanager shutdown for decommission vs shutdown for restart - Key: YARN-1362 URL: https://issues.apache.org/jira/browse/YARN-1362 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1362.patch When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) Container ID format and clustertimestamp for Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996400#comment-13996400 ] Tsuyoshi OZAWA commented on YARN-2052: -- One discussion point is how YARN apps and cluster management systems(e.g. Apache Ambari) depend on the container id format currently. For example, MRv2 uses utility methods like ConverterUtils.toContainerId(containerIdStr) provided in org.apache.hadoop.yarn.util. Container ID format and clustertimestamp for Work preserving restart Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA We've been discussing whether container id format is changed to include cluster timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking the discussion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-570: --- Assignee: Akira AJISAKA (was: PengZhang) Assigned to myself. [~peng.zhang], feel free to reassign if you want to work. Time strings are formated in different timezone --- Key: YARN-570 URL: https://issues.apache.org/jira/browse/YARN-570 Project: Hadoop YARN Issue Type: Bug Reporter: PengZhang Assignee: Akira AJISAKA Attachments: MAPREDUCE-5141.patch Time strings on different page are displayed in different timezone. If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as Wed, 10 Apr 2013 08:29:56 GMT If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 16:29:56 Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996343#comment-13996343 ] Tsuyoshi OZAWA commented on YARN-556: - Good point, Bikas. Created YARN-2052 for tracking container id discussion. [~adhoot], let's discuss there. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart
[ https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996502#comment-13996502 ] Junping Du commented on YARN-1362: -- That make sense. The patch LGTM. Kick off jenkins again as patch's time is a little long (but still can be applied). Will commit it after jenkins +1. Distinguish between nodemanager shutdown for decommission vs shutdown for restart - Key: YARN-1362 URL: https://issues.apache.org/jira/browse/YARN-1362 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1362.patch When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1751) Improve MiniYarnCluster for log aggregation testing
[ https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996587#comment-13996587 ] Jason Lowe commented on YARN-1751: -- +1, committing this. Improve MiniYarnCluster for log aggregation testing --- Key: YARN-1751 URL: https://issues.apache.org/jira/browse/YARN-1751 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1751-trunk.patch, YARN-1751.patch MiniYarnCluster specifies individual remote log aggregation root dir for each NM. Test code that uses MiniYarnCluster won't be able to get the value of log aggregation root dir. The following code isn't necessary in MiniYarnCluster. File remoteLogDir = new File(testWorkDir, MiniYARNCluster.this.getName() + -remoteLogDir-nm- + index); remoteLogDir.mkdir(); config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR, remoteLogDir.getAbsolutePath()); In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to FileContext.getFileContext() call. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1302) Add AHSDelegationTokenSecretManager for ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996621#comment-13996621 ] Zhijie Shen commented on YARN-1302: --- It looks like we don't need to implement separate DT stack for a single daemon. Add AHSDelegationTokenSecretManager for ApplicationHistoryProtocol -- Key: YARN-1302 URL: https://issues.apache.org/jira/browse/YARN-1302 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Like the ApplicationClientProtocol, ApplicationHistoryProtocol needs its own security stack. We need to implement AHSDelegationTokenSecretManager, AHSDelegationTokenIndentifier, AHSDelegationTokenSelector and other analogs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996675#comment-13996675 ] Bikas Saha commented on YARN-2052: -- The RM identifier is effectively the epoch for the RM. We already use it in the NM to differentiate between allocations made by old RM vs the new RM. Using the appId in the container id prevents us from using this epoch number since the appId cannot change across restarts for containers belonging to the same app. That will be backwards incompatible. Another alternative would be to replace the monotonically increasing sequence number with a unique identifier like a UUID. But that is also incompatible. Another alternative is to create another epoch number for the RM in addition to the cluster timestamp. The monotonically increasing sequence could be a combination (concatenation) of the new epoch number and the sequence number. e.g. container_XXX_1000 after epoch 1. When the epoch number is 0 then we can drop the epoch number and things look the same as today. e.g. container_XXX_000. ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2052) Container ID format and clustertimestamp for Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2052: - Summary: Container ID format and clustertimestamp for Work preserving restart (was: ClusterId format and clustertimestamp) Container ID format and clustertimestamp for Work preserving restart Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA We've been discussing whether container id format is changed to include cluster timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking the discussion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996752#comment-13996752 ] Min Zhou commented on YARN-2048: Make sense now, please go ahead. List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2017: -- Attachment: YARN-2017.3.patch Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996492#comment-13996492 ] Tsuyoshi OZAWA commented on YARN-2017: -- Thanks for your patch, Jian! Some comments: 1. +1 for using Generics to avoid warnings by casting as Sandy and Wangda mentioned. 2. Can we assert node with Preconditions.checkNotNull() in SchedulerNode? {code} public SchedulerNode(RMNode node, boolean usePortForNodeName) { this.rmNode = node; this.availableResource = Resources.clone(node.getTotalCapability()); this.totalResourceCapability = Resources.clone(node.getTotalCapability()); ... } {code} 3. Some lines over 80 chars per line. Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995700#comment-13995700 ] Xuan Gong commented on YARN-1861: - bq. I tried to just apply the test-case and run it without the core change and was expecting the active RM to go to standby and the standby RM to go to active once the originally active RM is fenced. Instead I get a NPE somewhere. Can the test be fixed to do so? In the testcase, I manually send the RMFatalEvent with RMFatalEventType.STATE_STORE_FENCED to current active RM(rm1). This active RM will handle this event, and transit to Standby. Both of the RMs are in standby state, while the zk still thinks that rm1 is at active state. So, it will not trigger the leader election. I think this can mimic the behavior as we described previously. Without the core code change, this testcase will fail. Because NM is trying to connect the active RM, but neither of two RMs are active. So, the NPE is expected. bq. Also, we need to make sure that when automatic failover is enabled, all external interventions like a fence like this bug (and forced-manual failover from CLI?) do a similar reset into the leader election. There may not be cases like this today though.. For the external interventions for automatic failover right now , we have transitionToActive/transitionToStandby plus forcemanual from CLI. The current behaviors are if we do transitionToActive + forcemanual + current standby rm id. The standby rm will transit to Active. In the mean time, it will do the fence, and current active rm will transit to Standby. If there are any exceptions, the rm will either be terminated or go back to standby state which will reset the leader election. Both of the cases, the zk will trigger a new run of leader election. If we do transitionToStandby + forcemanual + current active rm id. Both of rms are in standby state. Another transitionToActive command is needed. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996776#comment-13996776 ] Jian He commented on YARN-2017: --- bq. Since this is original logic of SchedulerNode, I think it's better to throw exception instead of print a irresponsible log here. Null object passed in should be considered big problem in scheduler. And several following places in this class. On a second thought, user might pass in a resource request with null capability. I would prefer not changing it. In fact, we can add many other null checks in many places. Changed the patch back. Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996749#comment-13996749 ] Vinod Kumar Vavilapalli commented on YARN-1861: --- Okay, that's much better. +1. Will check this in once Jenkins says okay.. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-667) Data persisted in RM should be versioned
[ https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996354#comment-13996354 ] Junping Du commented on YARN-667: - Thanks for your comments, [~zjshen]! bq. Shall we consider the version of history data as well? That's good point. We should consider history data, like: ApplicationHistoryData, ApplicationAttemptHistoryData, ContainerHistoryData, etc. could be variable in future and make them proper versioned. In addition, any changes on format (i.e. APPLICATION_PREFIX, etc.) and path of history file (root+appID for now) could be very challenge on rolling upgrade (also seen as incompatible change I think). [~zjshen], do you have sense on how possibility it could happen in 2.X as you are currently work on ATS? Data persisted in RM should be versioned Key: YARN-667 URL: https://issues.apache.org/jira/browse/YARN-667 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Assignee: Junping Du Includes data persisted for RM restart, NodeManager directory structure and the Aggregated Log Format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
[ https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2014: -- Assignee: Jason Lowe Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9 Key: YARN-2014 URL: https://issues.apache.org/jira/browse/YARN-2014 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: patrick white Assignee: Jason Lowe Performance comparison benchmarks from 2.x against 0.23 shows AM scalability benchmark's runtime is approximately 10% slower in 2.4.0. The trend is consistent across later releases in both lines, latest release numbers are: 2.4.0.0 runtime 255.6 seconds (avg 5 passes) 0.23.9.12 runtime 230.4 seconds (avg 5 passes) Diff: -9.9% AM Scalability test is essentially a sleep job that measures time to launch and complete a large number of mappers. The diff is consistent and has been reproduced in both a larger (350 node, 100,000 mappers) perf environment, as well as a small (10 node, 2,900 mappers) demo cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-45) [Preemption] Scheduler feedback to AM to release containers
[ https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-45: Summary: [Preemption] Scheduler feedback to AM to release containers (was: Scheduler feedback to AM to release containers) [Preemption] Scheduler feedback to AM to release containers --- Key: YARN-45 URL: https://issues.apache.org/jira/browse/YARN-45 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Chris Douglas Assignee: Carlo Curino Fix For: 2.1.0-beta Attachments: YARN-45.1.patch, YARN-45.patch, YARN-45.patch, YARN-45.patch, YARN-45.patch, YARN-45.patch, YARN-45.patch, YARN-45_design_thoughts.pdf The ResourceManager strikes a balance between cluster utilization and strict enforcement of resource invariants in the cluster. Individual allocations of containers must be reclaimed- or reserved- to restore the global invariants when cluster load shifts. In some cases, the ApplicationMaster can respond to fluctuations in resource availability without losing the work already completed by that task (MAPREDUCE-4584). Supplying it with this information would be helpful for overall cluster utilization [1]. To this end, we want to establish a protocol for the RM to ask the AM to release containers. [1] http://research.yahoo.com/files/yl-2012-003.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1184) ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA
[ https://issues.apache.org/jira/browse/YARN-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1184: -- Issue Type: Sub-task (was: Bug) Parent: YARN-45 ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA --- Key: YARN-1184 URL: https://issues.apache.org/jira/browse/YARN-1184 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Affects Versions: 2.1.0-beta Reporter: J.Andreina Assignee: Chris Douglas Fix For: 2.1.1-beta Attachments: Y1184-0.patch, Y1184-1.patch preemption is enabled. Queue = a,b a capacity = 30% b capacity = 70% Step 1: Assign a big job to queue a ( so that job_a will utilize some resources from queue b) Step 2: Assigne a big job to queue b. Following exception is thrown at Resource Manager {noformat} 2013-09-12 10:42:32,535 ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw an Exception. java.lang.ClassCastException: java.util.Collections$UnmodifiableSet cannot be cast to java.util.NavigableSet at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getContainersToPreempt(ProportionalCapacityPreemptionPolicy.java:403) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:202) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:173) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
[ https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996737#comment-13996737 ] Vinod Kumar Vavilapalli commented on YARN-2014: --- Thanks for the info Jason. Do you have a link to the JIRA covering the FS ServiceLoader stuff? In your configs, what are the file-systems whose impls are defined? Or is it just the default impls added by the default config files? May be one thing that can be done, if possible and if you have time, is to remove the unnecessary service-loader declaration files (not sure what you call them) from the installation and try this again. Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9 Key: YARN-2014 URL: https://issues.apache.org/jira/browse/YARN-2014 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: patrick white Assignee: Jason Lowe Performance comparison benchmarks from 2.x against 0.23 shows AM scalability benchmark's runtime is approximately 10% slower in 2.4.0. The trend is consistent across later releases in both lines, latest release numbers are: 2.4.0.0 runtime 255.6 seconds (avg 5 passes) 0.23.9.12 runtime 230.4 seconds (avg 5 passes) Diff: -9.9% AM Scalability test is essentially a sleep job that measures time to launch and complete a large number of mappers. The diff is consistent and has been reproduced in both a larger (350 node, 100,000 mappers) perf environment, as well as a small (10 node, 2,900 mappers) demo cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-1366: - Attachment: YARN-1366.1.patch I updated the patch for follwing changes in AMRMClient(MapReduce is not considered here) 1. On Resync from RM, reset lastResponseId and re register with RM. 2. Add back ResourceRequest for last allocate request. 3. Followed by 1 and 2, AMRMClient continue heatbeat Patch does not contain test, and I will write test in next patches. Please review initiall patch ,does this satisfy task expectations. Work Item to be decided. 1. On resync, last ResourceRequest are added back to ask send back again heartbeat. Here my doubt is, what about old asks which are sent earlier heartbeat but not allocated? Earlier requests can be populated using remoteRequestTable. 2. For MapReduce changes, should be handled in this jira? Current behaviour of AMs treats RESYNC and SHUTDOWN as same.It would be very useful if resync and shutdown commands are issued separately by application master service. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997024#comment-13997024 ] Alejandro Abdelnur commented on YARN-1368: -- [~jianhe], [~vinodkv], Unless I'm missing something, Anubhav was working on this JIRA. It is great that Jian did the refactoring to have common code for the schedulers and some testcases for it, but most of the work has been done by Anubhav and he was working actively on it. We should reassign the JIRA back to Anubhav and let him drive it to completion, agree? Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1957) ProportionalCapacitPreemptionPolicy handling of corner cases...
[ https://issues.apache.org/jira/browse/YARN-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated YARN-1957: Fix Version/s: 3.0.0 2.5.0 ProportionalCapacitPreemptionPolicy handling of corner cases... --- Key: YARN-1957 URL: https://issues.apache.org/jira/browse/YARN-1957 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler, preemption Fix For: 3.0.0, 2.5.0, 2.4.1 Attachments: YARN-1957.patch, YARN-1957.patch, YARN-1957_test.patch The current version of ProportionalCapacityPreemptionPolicy should be improved to deal with the following two scenarios: 1) when rebalancing over-capacity allocations, it potentially preempts without considering the maxCapacity constraints of a queue (i.e., preempting possibly more than strictly necessary) 2) a zero capacity queue is preempted even if there is no demand (coherent with old use of zero-capacity to disabled queues) The proposed patch fixes both issues, and introduce few new test cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2052) Container ID format and clustertimestamp for Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-2052: - Description: (was: We've been discussing whether container id format is changed to include cluster timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking the discussion. ) Container ID format and clustertimestamp for Work preserving restart Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996756#comment-13996756 ] Jian He commented on YARN-2017: --- Thanks Sandy, Wangda, and Tsuyoshi for the review and comments ! bq. Why take out the header comment in SchedulerNode? accidentally removed, add it back. bq. Can we use generics to avoid all the casting (and find bugs)? makes sense. Changed the AbstractYarnScheduler to use generics. we may need to change SchedulerNode to SchedulerNodeFSSchedulerApp as well to avoid the type cast warning. Changing that causes the patch much bigger as too many references of SchedulerNode. we may fix it in a separate patch. bq. I'm wondering if it is meaningful to merge this method, too much code changes due to this merge. this will be commonly used by YARN-1368, that's why I merge it here. Fixed other comments accordingly. Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch, YARN-2017.2.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2052) ClusterId format and clustertimestamp
Tsuyoshi OZAWA created YARN-2052: Summary: ClusterId format and clustertimestamp Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA We've been discussing whether container id format is changed to include cluster timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking the discussion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997089#comment-13997089 ] Zhijie Shen commented on YARN-2048: --- Hi [~coderplay], what is the exact requirement? 2.4 has been already out, and YARN-1809 was not able to get in as we didn't have enough time test it thoroughly. Hence we target 2.5 for that jira, which should cover the scenario here. Usually we won't include major change in a maintain release, such as 2.4.1. However, if you think it is really urgent and required fix for 2.4.1, please feel free to reopen it, and set target version to 2.4.1. Thanks! List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997071#comment-13997071 ] Jian He commented on YARN-1368: --- [~tucu00], Please understand that the patch uploaded here is completely different from the prototype patch on YARN-556. The patch here is using a different approach to cover all schedulers and also the whole container recovery flow is different which simplifies things a lot. This jira itself was originally opened as “RM should populate running container allocation information from NM resync” and did not cover recovering the schedulers. I should have opened a new jira to express the approaches instead of renaming this one to avoid confusion. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996512#comment-13996512 ] Junping Du commented on YARN-2016: -- [~venkatnrangan], you are right. When similar bug happens, we often suspect the logic in client or server but ignore the wire logic. In most cases, we don't even have a simple unit test to verify these PBImpls which cause bug get hidden easily. Already filed YARN-2051 to address this, it would be great if you want to help there. Thanks! Yarn getApplicationRequest start time range is not honored -- Key: YARN-2016 URL: https://issues.apache.org/jira/browse/YARN-2016 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Venkat Ranganathan Assignee: Junping Du Fix For: 2.4.1 Attachments: YARN-2016.patch, YarnTest.java When we query for the previous applications by creating an instance of GetApplicationsRequest and setting the start time range and application tag, we see that the start range provided is not honored and all applications with the tag are returned Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1474: - Attachment: YARN-1474.12.patch Thank you for the great review, [~kkambatl]! Updated a patch to address the points. I believe that this latest patch gets much simpler than previous one by your review. This is the reply to your review comments and change logs in this patch: 1. TestRMDelegationTokens causes NPE in AllocationFileLoaderService#stop() if we don't check null. This is regression. Therefore, this patch still includes the null check in AllocationFileLoaderService. Please let me know if we accept this regression. Attached log when TestRMDelegationTokens causes NPE in the tail of this comments. 2. Added spaces between each interface ResourceSchedulerWrapper implements. 3. Updated a comment about ResourceScheduler#setRMContext() 4. Changed to call {{reinitialize()}} from serviceInit()/serviceStart() in all schedulers. 5. FairScheduler: Removed isUpdateThreadRunning/isSchedulingThreadRunning from FairScheduler. 6. FairScheduler: serviceStartInternal()/serviceStopInternal() is removed in 4. 7, 8. Changed to call updateThread/schedulingThread#join in serviceStop(). Additionally, AllocationFileLoaderService#reloadThread has same problem, so I changed to call join in AllocationFileLoaderService#stop() method. {quote} java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.stop(AllocationFileLoaderService.java:149) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceStop(FairScheduler.java:1268) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:506) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:839) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:889) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:944) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens.testRMDTMasterKeyStateOnRollingMasterKey(TestRMDelegationTokens.java:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {quote} Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2017: -- Attachment: YARN-2017.2.patch Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch, YARN-2017.2.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart
[ https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997095#comment-13997095 ] Junping Du commented on YARN-1362: -- I have commit this to trunk and branch-2. Thank you, [~jlowe]! Distinguish between nodemanager shutdown for decommission vs shutdown for restart - Key: YARN-1362 URL: https://issues.apache.org/jira/browse/YARN-1362 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.5.0 Attachments: YARN-1362.patch When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2054: --- Attachment: yarn-2054-1.patch Straight-forward patch that brings the cumulative to 10 seconds, same as the yarn.resourcemanager.zk-timeout-ms. Poor defaults for YARN ZK configs for retries and retry-inteval --- Key: YARN-2054 URL: https://issues.apache.org/jira/browse/YARN-2054 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2054-1.patch Currenly, we have the following default values: # yarn.resourcemanager.zk-num-retries - 500 # yarn.resourcemanager.zk-retry-interval-ms - 2000 This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997054#comment-13997054 ] Ashwin Shankar commented on YARN-2012: -- Hi [~sandyr] , do you have any comments ? Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute - Key: YARN-2012 URL: https://issues.apache.org/jira/browse/YARN-2012 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue . This queue should be an existing queue,if not we fall back to root.default queue hence keeping this rule as terminal. This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997006#comment-13997006 ] Tsuyoshi OZAWA commented on YARN-1474: -- Waiting for Jenkins. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-185) Add preemption to CS
[ https://issues.apache.org/jira/browse/YARN-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-185: - Issue Type: Sub-task (was: New Feature) Parent: YARN-45 Add preemption to CS Key: YARN-185 URL: https://issues.apache.org/jira/browse/YARN-185 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Arun C Murthy Umbrella jira to track adding preemption to CS, let's track via sub-tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997159#comment-13997159 ] Karthik Kambatla commented on YARN-2054: On a cluster with RM HA and buggy RM, this led to a long wait before failover. Poor defaults for YARN ZK configs for retries and retry-inteval --- Key: YARN-2054 URL: https://issues.apache.org/jira/browse/YARN-2054 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Currenly, we have the following default values: # yarn.resourcemanager.zk-num-retries - 500 # yarn.resourcemanager.zk-retry-interval-ms - 2000 This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1337) Recover containers upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1337: - Description: To support work-preserving NM restart we need to recover the state of the containers when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate. The state of finished containers also needs to be recovered. (was: To support work-preserving NM restart we need to recover the state of the containers that were active when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate.) Summary: Recover containers upon nodemanager restart (was: Recover active container state upon nodemanager restart) Updating headline and description to note that this task also includes recovering the state of finished containers as well. Recover containers upon nodemanager restart --- Key: YARN-1337 URL: https://issues.apache.org/jira/browse/YARN-1337 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe To support work-preserving NM restart we need to recover the state of the containers when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate. The state of finished containers also needs to be recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1368: Attachment: YARN-1368.combined.001.patch Thanks [~jianhe] for making the scheduler changes generic. I have added back the FairScheduler changes as per this. I have refactored your unit test so we can test both Capacity and Fair. Rest of the patch looks similar to my [YARN-566|https://issues.apache.org/jira/browse/YARN-556] prototype patch so we are pretty much in sync there. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1550) the page http:/ip:50030/cluster/scheduler has 500 error in fairScheduler
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1550: --- Description: three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} was: three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at
[jira] [Updated] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-2052: - Summary: ContainerId creation after work preserving restart is broken (was: Container ID format and clustertimestamp for Work preserving restart) ContainerId creation after work preserving restart is broken Key: YARN-2052 URL: https://issues.apache.org/jira/browse/YARN-2052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1803) Signal container support in nodemanager
[ https://issues.apache.org/jira/browse/YARN-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997062#comment-13997062 ] Ming Ma commented on YARN-1803: --- Vinod, thanks for the great feedback. So to summarize it, 1. Add signalContainers method to both ApplicationClientProtocol and ContainerManagementProtocol to support ordered list. 2. stopContainers will be deprecated eventually. 3. MR needs be changed to call signalContainers instead of stopContainers. For the SignalContainerCommand, I will update that in YARN-1897. We still need to define signalContainerRequest in addition to signalContainersRequest. Signal container support in nodemanager --- Key: YARN-1803 URL: https://issues.apache.org/jira/browse/YARN-1803 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1803.patch It could include the followings. 1. ContainerManager is able to process a new event type ContainerManagerEventType.SIGNAL_CONTAINERS coming from NodeStatusUpdater and deliver the request to ContainerExecutor. 2. Translate the platform independent signal command to Linux specific signals. Windows support will be tracked by another task. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1803) Signal container support in nodemanager
[ https://issues.apache.org/jira/browse/YARN-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997252#comment-13997252 ] Vinod Kumar Vavilapalli commented on YARN-1803: --- bq. 1. Add signalContainers method to [..] ContainerManagementProtocol to support ordered list. Yup. We can do the above as a follow up though. It seems like most cases are centered primarily around the RM API. Signal container support in nodemanager --- Key: YARN-1803 URL: https://issues.apache.org/jira/browse/YARN-1803 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1803.patch It could include the followings. 1. ContainerManager is able to process a new event type ContainerManagerEventType.SIGNAL_CONTAINERS coming from NodeStatusUpdater and deliver the request to ContainerExecutor. 2. Translate the platform independent signal command to Linux specific signals. Windows support will be tracked by another task. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhou reopened YARN-2048: I'd like to reopen it, because I think we need this feature for 2.[0-4].x users before timeline server get finished. How do you like to do, [~zjshen]? List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2051) Add more unit tests for PBImpl that didn't get covered
Junping Du created YARN-2051: Summary: Add more unit tests for PBImpl that didn't get covered Key: YARN-2051 URL: https://issues.apache.org/jira/browse/YARN-2051 Project: Hadoop YARN Issue Type: Test Reporter: Junping Du Priority: Critical From YARN-2016, we can see some bug could exist in PB implementation of protocol. The bad news is most of these PBImpl don't have any unit test to verify the info is not lost or changed after serialization/deserialization. We should add more tests for it. -- This message was sent by Atlassian JIRA (v6.2#6252)