[jira] [Commented] (YARN-1197) Add container merge support in YARN
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766343#comment-13766343 ] Wangda Tan commented on YARN-1197: -- I don't know is it possible to add this in RM or NM side. And I think it should be easier to move some existing applications (OpenMPI, PBS, etc.) to YARN platform, because such application should have their own daemons in old implementation, and container merge can be helpful to leverage their original logic with less modifications to be a resident of YARN :) Welcome your suggestions and comments! -- Thanks, Wangda Add container merge support in YARN --- Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is, In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues. A very common resource request in MPI world is, give me 100G memory in the cluster, I will launch 100 processes in this resource. In current YARN, we have following two choices to make this happen, 1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above. 2) Send a larger resource request, like 10G. But we may encounter following problems, 2.1 Such a large resource request is hard to get at one time. 2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node). 2.3 Hard to decide how much resource to ask. So my proposal is, 1) We can incrementally send resource request with small resources like before, until we get enough resources in total 2) Merge resource in the same node, make only one big container in each node 3) Launch daemons in each node, and the daemon will spawn its local processes and manage them. For example, We need to run 10 processes, 1G for each, finally we got container 1, 2, 3, 4, 5 in node1. container 6, 7, 8 in node2. container 9, 10 in node3. Then we will, merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1197) Add container merge support in YARN
Wangda Tan created YARN-1197: Summary: Add container merge support in YARN Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is, In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues. A very common resource request in MPI world is, give me 100G memory in the cluster, I will launch 100 processes in this resource. In current YARN, we have following two choices to make this happen, 1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above. 2) Send a larger resource request, like 10G. But we may encounter following problems, 2.1 Such a large resource request is hard to get at one time. 2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node). 2.3 Hard to decide how much resource to ask. So my proposal is, 1) We can incrementally send resource request with small resources like before, until we get enough resources in total 2) Merge resource in the same node, make only one big container in each node 3) Launch daemons in each node, and the daemon will spawn its local processes and manage them. For example, We need to run 10 processes, 1G for each, finally we got container 1, 2, 3, 4, 5 in node1. container 6, 7, 8 in node2. container 9, 10 in node3. Then we will, merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1197) Add container merge support in YARN
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767242#comment-13767242 ] Wangda Tan commented on YARN-1197: -- Hi Bikas, Thanks for reply, it helps me understanding YARN mechanism, but I think there're some misunderstanding. In some HPC cases, how many processes will be launched in different node is not determinated before we submit job, just give it total enough resource (like 100G) in the cluster to it. So we will have following problems, 1) We will launch exactly one daemon process in each node, and this daemon process launch other local processes. This is root cause of why we want this feature 2) We don't know how much resource to request in this case, # Large requests may cause some wasting, and it's hard to get from RM # Small requests may not enough (when cluster is busy, we cannot regret if we already have a small room in a node, we can only return it and ask a larger one, but when we returned it, the room may be occupied by another app, and we cannot take it back. When we have a such API, we can implement our AM more easily, we can iteratively send request to RM which is depended on what we already have. And finally, we can merge them to different big containers and give it to real app. (like PBS/TORQUE/MPI), we can make a small cluster in YARN, and can support HPC workloads very well. (It's a little similar to mesos, aggregate resources to a slave daemon, and the slave daemon can manage these resources, but we don't need make it dynamic -- increase container size when its running, just merge it before we start processes will be good enough) :) Add container merge support in YARN --- Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1197) Support increasing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769256#comment-13769256 ] Wangda Tan commented on YARN-1197: -- {quote} Increasing resources for a container while in acquired state is not different from waiting for some more time on the RM and allocating the larger container in the first attempt, right? {quote} I think there's a little difference here, because waiting for resource for a big container in the first attempt, scheduler will put the the request to reservedContainer at FSSchedulableNode or FiCaSchedulerNode. This will be considered as an exception, RM will try to satisfy such reserved container first when many different requests existed at the same time in a same node. But if we try to ask more resource in an acquired container, I don't know what's your preferring, do you want to create another exception which can put an acquired container to *ScheduableNode to make it can get prior proceeded or just simply make the request as a normal resource request? {quote} Also, the RM starts a timer for each acquired container and expects the container to be launched on the NM before the timer expires. So we dont have too much time for the container to be launched and thus we cannot wait for increasing the resources. {quote} I don't know if we can refresh(receivePing) the timer for a container when we successfully increased resource for it? {quote} To be useful, we have to be able to increase the resources of a running container. I agree that its a significant change. So making the change will need a more thorough investigation and clear design proposal. {quote} Agree! I'd like to help moving this forward, I need investigate and consider end-to-end cases and draft a design proposal for it, once I've some ideas or question, I will let you know :) Thanks Support increasing resources of an allocated container -- Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770380#comment-13770380 ] Wangda Tan commented on YARN-1197: -- I totally agree with you, I'll work out a plan considered increase/decrease an available container (allocated/running) with RM-AM-NM communication, will keep you posted. Thanks. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-1197: - Attachment: yarn-1197.pdf Added a initial proposal for it, include increase/decrease a aquired or running container, hope anybody can help me review it. Then we can move forward to break down tasks and start work on it. Thanks. Support changing resources of an allocated container Key: YARN-1197 URL: https://issues.apache.org/jira/browse/YARN-1197 Project: Hadoop YARN Issue Type: Task Components: api, nodemanager, resourcemanager Affects Versions: 2.1.0-beta Reporter: Wangda Tan Attachments: yarn-1197.pdf Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-2297) Preemption can hang in corner case by not allowing any task container to proceed.
[ https://issues.apache.org/jira/browse/YARN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063156#comment-14063156 ] Wangda Tan commented on YARN-2297: -- Hi [~chris.douglas], Thanks for your jumping in, bq. Does this occur when the absolute guaranteed capacity of a queue is smaller than the minimum container size? This can be happened when (used_capacity_of_a_queue + newly_allocated_container_resource guaranteed_resource_of_a_queue) (used_capacity_of_a_queue guaranteed_resource_of_a_queue), So I propose to change {code} while (toBePreempt 0): foreach application: foreach container: if (toBePreempt 0): do preemption {code} To {code} while (toBePreempt 0): foreach application: foreach container: if (toBePreempt 0) and (container.resource toBePreempt * 2): do preemption {code} To make sure a container is not preempted too aggressive. Does this answered your question? Thanks, Wangda Preemption can hang in corner case by not allowing any task container to proceed. - Key: YARN-2297 URL: https://issues.apache.org/jira/browse/YARN-2297 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan Priority: Critical Preemption can cause hang issue in single-node cluster. Only AMs run. No task container can run. h3. queue configuration Queue A/B has 1% and 99% respectively. No max capacity. h3. scenario Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use 1 user. Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB. Occupy entire cluster. Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each. Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch. No task of either app can proceed. h3. commands /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.randomtextwriter.bytespermap=2147483648 -Dmapreduce.job.queuename=A -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -Dmapreduce.randomtextwriter.mapsperhost=1 -Dmapreduce.randomtextwriter.totalbytes=2147483648 dir1 /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.job.queuename=B -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -m 1 -r 0 -mt 4000 -rt 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2297) Preemption can hang in corner case by not allowing any task container to proceed.
[ https://issues.apache.org/jira/browse/YARN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063249#comment-14063249 ] Wangda Tan commented on YARN-2297: -- Hi [~chris.douglas], Thanks for your reply, I think dead zone is really a good idea to solve the jitter problem. Wangda Preemption can hang in corner case by not allowing any task container to proceed. - Key: YARN-2297 URL: https://issues.apache.org/jira/browse/YARN-2297 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan Priority: Critical Preemption can cause hang issue in single-node cluster. Only AMs run. No task container can run. h3. queue configuration Queue A/B has 1% and 99% respectively. No max capacity. h3. scenario Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use 1 user. Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB. Occupy entire cluster. Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each. Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch. No task of either app can proceed. h3. commands /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.randomtextwriter.bytespermap=2147483648 -Dmapreduce.job.queuename=A -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -Dmapreduce.randomtextwriter.mapsperhost=1 -Dmapreduce.randomtextwriter.totalbytes=2147483648 dir1 /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.job.queuename=B -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -m 1 -r 0 -mt 4000 -rt 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2297) Preemption can hang in corner case by not allowing any task container to proceed.
[ https://issues.apache.org/jira/browse/YARN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063255#comment-14063255 ] Wangda Tan commented on YARN-2297: -- Hi [~sunilg], Thanks for providing thoughts here! For your 1st point, I think it should be better solved as Chris suggested, using the dead zone parameter yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity For your 2nd point, {code} I feel now we will take a percentage here to find which queue is under utilized more based on its used vs guaranteed_capacity ? {code} I think if we use ratio(used, guaranteed), a problem is, assuming qA has configured 100MB, it used 10MB, qB has 2GB, it used 500MB, can we say we should allocate resource for qA instead of qB? We've some other options here, 1. Use (guaranteed - used) 2. Use a combined function like sigmoid(ratio(used, guaranteed)) * (guaranteed - used) Do you have any ideas here? Thanks, Wangda Preemption can hang in corner case by not allowing any task container to proceed. - Key: YARN-2297 URL: https://issues.apache.org/jira/browse/YARN-2297 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan Priority: Critical Preemption can cause hang issue in single-node cluster. Only AMs run. No task container can run. h3. queue configuration Queue A/B has 1% and 99% respectively. No max capacity. h3. scenario Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use 1 user. Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB. Occupy entire cluster. Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each. Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch. No task of either app can proceed. h3. commands /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.randomtextwriter.bytespermap=2147483648 -Dmapreduce.job.queuename=A -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -Dmapreduce.randomtextwriter.mapsperhost=1 -Dmapreduce.randomtextwriter.totalbytes=2147483648 dir1 /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.job.queuename=B -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -m 1 -r 0 -mt 4000 -rt 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063316#comment-14063316 ] Wangda Tan commented on YARN-796: - Hi [~john.jian.fang], Thanks for providing use cases. bq. Why do users have to choose either decentralized or centralized label configuration? This is because cases like user may what to remove some static labels via dynamic API, and for next time RM restart, it will load static labels again. It will be hard to manage static/dynamic together, we need handling conflicts, etc. bq. To me, the restful API could be more useful than the Admin UI. I think both of them are very important in normal cases. RESTful API can be used by other management framework. Admin UI can be directly used by admin to tagging nodes. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063324#comment-14063324 ] Wangda Tan commented on YARN-796: - Hi [~sunilg], Thanks for reply, bq. 1. In our use case scenarios, we are more likely to have OR and NOT. I feel combination of these labels need to be in a defined or restricted way. Result of some combinations (AND, OR and NOT) may come invalid, and some may need to be reduced. This complexity need not have to bring to RM to take a final decision. Agree that we need some restricted way, we need think harder about this :) bq. 2. Reservation: If a node label has many nodes under it, then there is a chance of reservation. Valid candidates may come later, so solution can be look in to this aspect also. Node Label level reservations ? I haven't thought about this before, I'll think about it, thanks for reminding me bq. 3. Centralized Configuration: If a new node is added to cluster, may be it can be started by having a label configuration in its yarn-site.xml. This may be fine I feel. your thoughts? I think this is more like a decentralized configuration in your description. For centralized configuration, I think maybe there's a node label repo which stores mapping of nodes to labels. And we will provide RESTful API for changing them. Thanks, Wangda Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2257) Add user to queue mappings to automatically place users' apps into specific queues
[ https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063721#comment-14063721 ] Wangda Tan commented on YARN-2257: -- Hi [~sandyr], Thanks for pointing me this, I have a question here, what's the expected behavior when an admin what's to add a new QueuePlacementRule? I guess a new class need to be added to Hadoop project, and need rebuild Hadoop, right? I think it's a little over-kill here, user may want convenient instead of flexibility. If you think the rules I mentioned is not flexible enough, maybe we can extend it to rules with pattern, like %user-root.users.%user which means putting application from %user to root.users.%user. Which maybe easier for admin to add new QueuePlacementRule. I agree it's a good fit for YARN in general, but we should make it easier to use. Please feel free to let me know you comments, thanks. Wangda Add user to queue mappings to automatically place users' apps into specific queues -- Key: YARN-2257 URL: https://issues.apache.org/jira/browse/YARN-2257 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Patrick Liu Assignee: Vinod Kumar Vavilapalli Labels: features Currently, the fair-scheduler supports two modes, default queue or individual queue for each user. Apparently, the default queue is not a good option, because the resources cannot be managed for each user or group. However, individual queue for each user is not good enough. Especially when connecting yarn with hive. There will be increasing hive users in a corporate environment. If we create a queue for a user, the resource management will be hard to maintain. I think the problem can be solved like this: 1. Define user-queue mapping in Fair-Scheduler.xml. Inside each queue, use aclSubmitApps to control user's ability. 2. Each time a user submit an app to yarn, if the user has mapped to a queue, the app will be scheduled to that queue; otherwise, the app will be submitted to default queue. 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2285) Preemption can cause capacity scheduler to show 5,000% queue capacity.
[ https://issues.apache.org/jira/browse/YARN-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063725#comment-14063725 ] Wangda Tan commented on YARN-2285: -- Thanks comments from Vinod and Sunil, bq. From the look of it, it sounds like this isn't tied to preemption. It looks like this was a bug that exists even when preemption is not enabled. Can we validate that? I'll validate this tomorrow The usage of root queue above 100% is caused by reserved container, currently the UI shows queue allocated+reserved, we may need change that for user easier understand what happened. Preemption can cause capacity scheduler to show 5,000% queue capacity. -- Key: YARN-2285 URL: https://issues.apache.org/jira/browse/YARN-2285 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.5.0 Environment: Turn on CS Preemption. Reporter: Tassapol Athiapinya Assignee: Wangda Tan Priority: Minor Attachments: preemption_5000_percent.png I configure queue A, B to have 1%, 99% capacity respectively. There is no max capacity for each queue. Set high user limit factor. Submit app 1 to queue A. AM container takes 50% of cluster memory. Task containers take another 50%. Submit app 2 to queue B. Preempt task containers of app 1 out. Turns out capacity of queue B increases to 99% but queue A has 5000% used. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064483#comment-14064483 ] Wangda Tan commented on YARN-415: - Hi [~eepayne], bq. Since every RMAppAttemptImpl object has a reference to an RMAppAttemptMetrics object, you are suggesting that I move the resource usage stats to RMAppAttemptMetrics. Yes bq. Also, when reporting on resource usage, use the reporting methods from RMAppAttempt and RMApp. I'm not quite sure about what's the reporting methods, it should be getRMAppAttemptMetrics in attempt and getRMAppMetrics in app. bq. You're suggestion is to keep resource usage stats only for running containers. Yes bq. For completed containers, you are suggesting that the calculation be done for final resource usage stats within the RMContainerImpl#FinishTransition method and have that send the resource stats as a payload within the RMAppAttemptC ... No, you can update current trunk code, and check RMContainerImpl#FinishedTransition#updateMetricsIfPreempted, you can change the updateMetricsIfPreempted to something like updateAttemptMetrics. And create a new method in RMAppAttemptMetrics, like updateResourceUtilization. The benefit of doing this is you don need send payload to RMAppAttempt, all you needed information should be existed in RMContainer. Do them make sense to you? Please feel free to let me know if you have any questions. Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2285) Preemption can cause capacity scheduler to show 5,000% queue capacity.
[ https://issues.apache.org/jira/browse/YARN-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064492#comment-14064492 ] Wangda Tan commented on YARN-2285: -- I've verified this will still happen even if preemption is not enabled, both for 5000% queue usage and above 100% root queue usage. Preemption can cause capacity scheduler to show 5,000% queue capacity. -- Key: YARN-2285 URL: https://issues.apache.org/jira/browse/YARN-2285 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.5.0 Environment: Turn on CS Preemption. Reporter: Tassapol Athiapinya Assignee: Wangda Tan Priority: Minor Attachments: preemption_5000_percent.png I configure queue A, B to have 1%, 99% capacity respectively. There is no max capacity for each queue. Set high user limit factor. Submit app 1 to queue A. AM container takes 50% of cluster memory. Task containers take another 50%. Submit app 2 to queue B. Preempt task containers of app 1 out. Turns out capacity of queue B increases to 99% but queue A has 5000% used. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2257) Add user to queue mappings to automatically place users' apps into specific queues
[ https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064512#comment-14064512 ] Wangda Tan commented on YARN-2257: -- Hi [~sandyr], I agree we should have an existing library for queue rules. But I feel like we'd better add simple pattern match mechanism like %user-root.users.%user I mentioned before. Which will take reasonable effort but can cover more cases, do you agree with that? Thanks, Add user to queue mappings to automatically place users' apps into specific queues -- Key: YARN-2257 URL: https://issues.apache.org/jira/browse/YARN-2257 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Patrick Liu Assignee: Vinod Kumar Vavilapalli Labels: features Currently, the fair-scheduler supports two modes, default queue or individual queue for each user. Apparently, the default queue is not a good option, because the resources cannot be managed for each user or group. However, individual queue for each user is not good enough. Especially when connecting yarn with hive. There will be increasing hive users in a corporate environment. If we create a queue for a user, the resource management will be hard to maintain. I think the problem can be solved like this: 1. Define user-queue mapping in Fair-Scheduler.xml. Inside each queue, use aclSubmitApps to control user's ability. 2. Each time a user submit an app to yarn, if the user has mapped to a queue, the app will be scheduled to that queue; otherwise, the app will be submitted to default queue. 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064749#comment-14064749 ] Wangda Tan commented on YARN-2305: -- Hi [~sunilg], Thanks for taking this issue, I think there're two issues in your screenshot, 1) Root queue usage above 100% It is possible that queue used resource is larger than its guaranteed resource because of container reservation. We may need show reserved resource and used resource separately in our web UI. I encountered a similar problem in YARN-2285 too. 2) Total cluster memory showing on web UI is different from CapacityScheduler.clusterResource This seems a new issue to me, memory showing on web UI is usedMemory+availableMemory of root queue. I feel like CSQueueUtils.updateQueueStatistics has some issues when we reserve container in LeafQueue. Hope to get more thoughts in your side. Thanks, Wangda When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Sunil G Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
Wangda Tan created YARN-2308: Summary: NPE happened when RM restart after CapacityScheduler queue configuration changed Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065910#comment-14065910 ] Wangda Tan commented on YARN-415: - Hi [~eepayne], Thanks for updating your patch, the failed test case should be irrelevant to your changes, it is tracked by YARN-2270. Reviewing.. Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066145#comment-14066145 ] Wangda Tan commented on YARN-415: - Hi [~eepayne], I've spent some time to review and think about the JIRA. I have a 1. Revert changes of SchedulerAppReport, we already have changed ApplicationResourceUsageReport, and memory utilization should be a part of resource usage report. 2. Remove getMemory(VCore)Seconds from RMAppAttempt, modify RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running resource utilization. 3. put {code} ._(Resources:, String.format(%d MB-seconds, %d vcore-seconds, app.getMemorySeconds(), app.getVcoreSeconds())) {code} from Application Overview to Application Metrics, and rename it to Resource Seconds. It should be considered as a part of application metrics instead of overview. 4. Change finishedMemory/VCoreSeconds to AtomicLong in RMAppAttemptMetrics to make it can be efficiently accessed by multi-thread. 5. I think it's better to add a new method in SchedulerApplicationAttempt like getMemoryUtilization, which will only return memory/cpu seconds. We do this to prevent locking scheduling thread when showing application metrics on web UI. getMemoryUtilization will be used by RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running resource utilization. And used by SchedulerApplicationAttempt#getResourceUsageReport as well. The MemoryUtilization class may contain two fields: runningContainerMemory(VCore)Seconds. 6. Since compute running container resource utilization is not O(1), we need scan all containers under an application. I think it's better to cache a previous compute result, and it will be recomputed after several seconds (maybe 1-3 seconds should be enough) elapsed. And you can modify SchedulerApplicationAttempt#liveContainers to be a ConcurrentHashMap. With #6, get memory utilization to show metrics on web UI will not lock scheduling thread at all. Please let me know if you have any comments here, Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066154#comment-14066154 ] Wangda Tan commented on YARN-2305: -- Thanks for your elaboration, I understand now, I think this is inconsistency between ParentQueue and LeafQueue, using clusterResource instead of allocated+available can definitely solve this problem. When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Sunil G Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066156#comment-14066156 ] Wangda Tan commented on YARN-2308: -- I think it should doable, queue of application missing should not make RM failure to start. NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Priority: Critical I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067542#comment-14067542 ] Wangda Tan commented on YARN-2008: -- Hi [~cwelch], Thanks for working on this patch. However, I've thought about this for a while, I'm wondering if we should change this behavior. With preemption, we don't need consider used capacity of sibling or sibling of parents. Preemption policy will take care of over used queues. In addition, even if we have preemption disabled, the headroom should not be changed as well (see next). If we define headroom as maximum capacity of an application can get, the formula headroom = min((userLimit, queue-max-cap) - consumed) should be correct. But if we define headroom as maximum *guaranteed* capacity of an application can get, the formula should be changed to headroom = min((userLimit, queue-max-cap, queue-guaranteed-cap) - consumed). Does this make sense to you? Please let me know if you have any comments. Thanks, Wangda CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2297) Preemption can prevent progress in small queues
[ https://issues.apache.org/jira/browse/YARN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067544#comment-14067544 ] Wangda Tan commented on YARN-2297: -- bq. I feel this can create a little bit more starvation for queues configured with less capacity. +1, this should not be reasonable bq. Yes. This make more sense, it can neutralize ratio as well as difference to a uniform way. I feel more sampling can be done to come with a better approach. i can check and update you. I feel it should be a better way too, looking forward your update, we should make a fact-based decision :) Preemption can prevent progress in small queues --- Key: YARN-2297 URL: https://issues.apache.org/jira/browse/YARN-2297 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Assignee: Wangda Tan Priority: Critical Preemption can cause hang issue in single-node cluster. Only AMs run. No task container can run. h3. queue configuration Queue A/B has 1% and 99% respectively. No max capacity. h3. scenario Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use 1 user. Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB. Occupy entire cluster. Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each. Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch. No task of either app can proceed. h3. commands /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.randomtextwriter.bytespermap=2147483648 -Dmapreduce.job.queuename=A -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -Dmapreduce.randomtextwriter.mapsperhost=1 -Dmapreduce.randomtextwriter.totalbytes=2147483648 dir1 /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep -Dmapreduce.map.memory.mb=2000 -Dyarn.app.mapreduce.am.command-opts=-Xmx1800M -Dmapreduce.job.queuename=B -Dmapreduce.map.maxattempts=100 -Dmapreduce.am.max-attempts=1 -Dyarn.app.mapreduce.am.resource.mb=2000 -Dmapreduce.map.java.opts=-Xmx1800M -m 1 -r 0 -mt 4000 -rt 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068116#comment-14068116 ] Wangda Tan commented on YARN-796: - Really thanks all your comments above, As Sandy, Alejandro and Allen mentioned, concerns of centralized configuration. My thinking is, node label is more dynamic comparing to any other existing options of NM. An important use case we can see is, some customers want to mark label on each node indicate which department/team the node belongs to, when a new team comes in and new machines added, labels may need to be changed. And also, it is possible that the whole cluster is booked to run some huge batch job at 12am-2am for example. So such labels will be changed frequently. If we only have distributed configuration on each node, it is a nightmare for admins to re-configure. I think we should have a same internal interface for destributed/centralized configuration. Like what we've done for RMStateStore. And as Jian Fang mentioned, bq. doubt about the assumption for admin to configure labels for a cluster. I think using script to mark labels is a great way to saving configuration works. But lots of other use cases need human intervention as well. Good examples like from Allen and me. Thanks, Wangda Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068124#comment-14068124 ] Wangda Tan commented on YARN-796: - Hi Alejandro, I totally understand the use case I mentioned is antithetical of the design philosophy of YARN, which should be elastically sharing resources of a multi-tenant environment. But hard partition has some important use cases, even if this is not strongly recommended. Like in some performance-sensitive environment. For example user may want to run HBase master/region-servers in a group of nodes, and don't want any other tasks running in these nodes even if they have free resource. Our current queue configuration cannot solve such problem, of course user can create a separate YARN cluster in this case, but I think make such NMs under a same RM is easy to use and manage. Do you agree? Thanks, Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068145#comment-14068145 ] Wangda Tan commented on YARN-796: - Alejandro, I think we've mentioned this in our design doc, you check check https://issues.apache.org/jira/secure/attachment/12654446/Node-labels-Requirements-Design-doc-V1.pdf, top level requirements-admin tools-Security and access controls for managing Labels. Please let me know if you have any comments on it. Thanks :), Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068147#comment-14068147 ] Wangda Tan commented on YARN-1198: -- I've just taken a look at all sub tasks of this JIRA, I'm wondering if we should define what is the headroom first. In previous YARN, including YARN-1198 the headroom is defined as the maximum resource of an application can get. And in YARN-2008, the headroom is defined as the available resource of an application can get, because we already considered used resource of sibling queues. I'm afraid if we need add a new field like guaranteed headroom of an application consider its absolute capacity (not maximum capacity) and user-limits, etc. We may keep both of them because, - The maximum resource is not always achievible because sum of maximum resource of leaf queues may excess cluster resource. - With preemption, resource beyond guaranteed resource will be likely preempted. It should be consider as a temporary resource. And with this, AM can, - Using guaranteed headroom to allocate resource which will not be preempted. - Using maximum headroom to try to allocate resource beyond its guaranteed headroom. And in my humble opinion, the available resource of an application can get doesn't make a lot of sense here, and may cause some backward-compatible problems as well. Because in a dynamic cluster, the number can change rapidly, it is possible that a cluster is fulfilled by another application just happens one second after the AM got the available headroom. And also, this field can not solve the deadlock problem as well, a malicious application can ask much more resource of this, or a careless developer totally ignore this field. The only valid solution in my head is putting such logic into scheduler side, and enforce resource usage by preemption policy. Any thoughts? [~jlowe], [~cwelch] Thanks, Wangda Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1198.1.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068148#comment-14068148 ] Wangda Tan commented on YARN-2008: -- Hi [~cwelch], [~airbots], I've put my comment on YARN-1198: https://issues.apache.org/jira/browse/YARN-1198?focusedCommentId=14068147page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14068147, because I think it is a general comment of headroom. Please share your ideas here, Thanks, Wangda CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Chen He Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068163#comment-14068163 ] Wangda Tan commented on YARN-796: - Hi [~sunilg], bq. 2. Regarding reservations, how about introducing node-label reservations. Ideas is like, if an application is lacking resource on a node, it can reserve on that node as well as to node-label. So when a suitable node update comes from another node in same node-label, can try allocating container in new node by unreserving from old node. I think this makes sense, we'd better support this. I will check our current resource reservation/unreservation logic how to support it, will keep you posted. bq. 3. My approach was more like have a centralized configuration, but later after some time, if want to add a new node to cluster, then it can start with a hardcoded label in its yarn-site. In your approach, we need to use REStful API or admin command to bring this node under one label. May be while start up itself this node can be set under a label. your thoughts? I think a problem of mixed centralized/distributed configuration I can see is, it will be hard to manage them after RM/NM restart -- should we use labels specified in NM config or our centralized config? I also replied Jian Fang previously about this: https://issues.apache.org/jira/browse/YARN-796?focusedCommentId=14063316page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14063316. Maybe a workaround is we can define the centralized config all always overwrite distributed config. E.g. user defined GPU in NM config, and admin use RESTful added FPGA, RM will serialize both GPU, FPGA into a centralized storage system. And after RM restart or NM restart, RM will ignore NM config if anything defined in RM. But I still think it's better to avoid use both of them together. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068184#comment-14068184 ] Wangda Tan commented on YARN-796: - bq. You can solve this problem today by just running separate RMs. I think it's not good for configure, user need maintain several configuration folders in their nodes for submission job. bq. In practice, however, marking nodes for specific teams in queue systems doesn't work because doing so assumes that the capacity never changes... i.e It is possible that you cannot replace a failure node by a random node in heterogeneous cluster. E.g. only some nodes have GPUs, and these nodes will be dedicated to be used by data scientist team. Percentage of queue capacity doesn't make a lot of sense here. bq. ... except, you guessed it: this is a solved problem today too. You just need to make sure the container sizes that are requested consume the whole node. Assume a HBase master want to run a node have 64G mem and infiniband. You can ask a 64G mem container, but it may be like to be allocated to a 128G node but doesn't have infiniband. Again, it's another heterogeneous issue. And ask for such a big container may need take a great amount of time, wait for resource reservation, etc. bq. it still wouldn't be a nightmare because any competent admin would use configuration management to roll out changes to the nodes in a controlled manner. It is very likely not every admin has scripts like you, especially some new YARN users, we'd better make this feature can be used out-of-box Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068248#comment-14068248 ] Wangda Tan commented on YARN-796: - Allen, I think what we was just talking about is how to support hard partition use case in YARN, aren't we? I'm surprised to get a -1 here, Nobody has ever said dynamic labeling from NM will not be supported. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069615#comment-14069615 ] Wangda Tan commented on YARN-796: - Hi Tucu, Thanks for providing thoughts about how to stage development works. It's reasonable and we're trying to scope work for first shooting as well. Will keep you posted. Thanks, Wangda Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069619#comment-14069619 ] Wangda Tan commented on YARN-796: - Jian Fang, I think it's make sense to make RM has a global picture because we can prevent typos created by admin manually filling labels on NM config, etc. In another hand, I think your use case is also reasonable, We'd better need to support both of them, as well as OR label expression. Will keep you posted when we made a plan. Thanks, Wangda Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069651#comment-14069651 ] Wangda Tan commented on YARN-1198: -- I agree with [~jlowe], [~airbots] and [~cwelch], used resource should be considered into headroom (which is YANR-2008). And apparently, application master can ask more than that number to get more resource possibly. I completely agree with what Jason mentioned, ignore headroom will not cause more problem except application itself. What I originally want to say is when putting headroom and gang scheduling together, it will cause deadlock problem and should be solved in scheduler side. But it seems kind of off-topic, let's ignore it here. Also, as Chen mentioned, we don't need consider preemption when computing headroom. And besides, when resource will be preempted from an app, the AM will receive messages about preemption requests, it should handle itself. Thanks, Wangda Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1198.1.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072696#comment-14072696 ] Wangda Tan commented on YARN-415: - Hi Eric, Thanks for updating your patch, I think now don't have major comments, *Following are some minor comments:* 1) RMAppAttemptImpl.java 1.1 There're some irrelevant line changes in RMAppAttemptImpl, could you please revert them? Like {code} RMAppAttemptEventType.RECOVER, new AttemptRecoveredTransition()) - + {code} 1.2 getResourceUtilization: {code} +if (rmApps != null) { + RMApp app = rmApps.get(attemptId.getApplicationId()); + if (app != null) { {code} I think the two cannot happen, we don't need check null to avoid potential bug here {code} + ApplicationResourceUsageReport appResUsageRpt = {code} It's better to name it appResUsageReport since rpt is not a common abbr of report. 2) RMContainerImpl.java 2.1 updateAttemptMetrics: {code} if (rmApps != null) { RMApp rmApp = rmApps.get(container.getApplicationAttemptId().getApplicationId()); if (rmApp != null) { {code} Again, I think the two null check is unnecessary 3) SchedulerApplicationAttempt.java 3.1 Some rename suggestions: (Please let me know if you have better idea) CACHE_MILLI - MEMORY_UTILIZATION_CACHE_MILLISECONDS lastTime - lastMemoryUtilizationUpdateTime cachedMemorySeconds - lastMemorySeconds same for cachedVCore ... 4) AppBlock.java Should we rename Resource Seconds: to Resource Utilization or something? 5) Test 5.1 I'm wondering if we need add a end to end test, since we changed RMAppAttempt/RMContainerImpl/SchedulerApplicationAttempt. It can consist submit an application, launch several containers, and finish application. And it's better to make the launched application contains several application attempt. While the application running, there're muliple containers running, and multiple containers finished. We can check if total resource utilization are expected. *To your comments:* 1) bq. One thing I did notice when these values are cached is that there is a race where containers can get counted twice: I think this can not be avoid, it should be a transient state and Jian He and I discussed about this long time before. But apparently, 3 sec cache make it not only a transient state. I suggest you can make lastTime in SchedulerApplicationAttempt protected. And in FiCaSchedulerApp/FSSchedulerApp, when remove container from liveContainer (in completedContainer method). You can set lastTime to a negtive value like -1, and next time when trying to get accumulated resource utilization, it will recompute all container utilization. 2) bq. I am a little reluctant to modify the type of SchedulerApplicationAttempt#liveContainers as part of this JIRA. That seems like something that could be done separately. I think that will be fine :), because current getRunningResourceUtilization is called by getResourceUsageReport. And getResourceUsageReport is synchronized, no matter we changed liveContainers to concurrent map or not, we cannot solve the locking problem. I agree to enhance it in a separated JIRA in the future. Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server).
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073899#comment-14073899 ] Wangda Tan commented on YARN-2308: -- [~lichangleo], thanks for working on it! Looking forward your patch. NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074094#comment-14074094 ] Wangda Tan commented on YARN-415: - Hi Eric, Thanks for updating your patch again, *To your comments,* bq. I was able to remove the rmApps variable, but I had to leave the check for app != null because if I try to take that out, several unit tests would fail with NullPointerException. Even with removing the rmApps variable, I needed to change TestRMContainerImpl.java to mock rmContext.getRMApps(). I would like to suggest to fix such UTs instead of inserting some kernel code to make UT pass. I'm not sure about the effort of doing this, if the effort is still reasonable, we should do it. bq. I'm still working on the unit tests as you suggested, but I wanted to get the rest of the patch up first so you can look at it No problem :), I can give some reviews about your existing changes. *I've reviewed some details of your patch, a very minor comments,* ApplicationCLI.java {code} + appReportStr.print(\tResources used : ); {code} We need change it to Resource Utilization as well? I think other the patch almost LGTM, looking forward your new patch contains an integration test. Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074249#comment-14074249 ] Wangda Tan commented on YARN-2069: -- Hi [~mayank_bansal], Thanks for working on this again. I've taken a brief look at your patch, I think the general appoarch in your patch is: - Compute a target-user-limit for a given queue, - Preempt containers according to a user's current comsumption and target-user-limit, - If more resource need to be preempted, we should consider preempt AM container, I think there're couple of rules we need respect (Please let me know if you don't agree with any of them), # Used resource of users in a queue after preempted should be as average as possible # Before we start preempting AM containers, all task containers should be preempted (according to YARN-2022, keep preempting AM container as least priority) # If we should preempt AM container, we should respect #1 too For #1, If we want to quantize the result, it should be: {code} i∈{user} Let rp_i = used-resource-after-preemption of user_i Minimize sqrt(Σ(rp - Σ(rp_i)/#{user})^2) i i {code} In another word, we should minimize standard deviation of used-resource-after-preemption. Since not all containers are equal in size, so it is possible that used-resource-after-preemption of a given user cannot precisely equal to target-user-limit. In our current logic, we will make used-resource-after-preemption = target-user-limit. considering following example, {code} qA: has user {V, W, X, Y, Z}; each user has one application V: app5: {4, 4, 4, 4}, //means V has 4 containers, each one has memory=4G, minimum_allocation=1G W: app4: {4, 4, 4, 4}, X: app3: {4, 4, 4, 4}, Y: app2: {4, 4, 4, 4, 4, 4}, Z: app1: {4} target-user-limit=11, resource-to-obtain=23 After preemption: V: {4, 4} W: {4, 4} X: {4, 4} Y: {4, 4, 4, 4, 4, 4} Z: {4} {code} This imbalance happens because, for every application we preempted, may excess user-limit (bias), the more user we processed, the more potentially accumulated bias we might have. In another word, the un-balanced is linear correlated number-of-user-in-a-queue multiplies average-container-size And we cannot solve this problem by preempting from user has most usage, still the example: {code} qA: has user {V, W, X, Y, Z}; each user has one application V: app5: {4, 4, 4, 4}, //means V has 4 containers, each one has memory=4G, minimum_allocation=1G W: app4: {4, 4, 4, 4}, X: app3: {4, 4, 4, 4}, Y: app2: {4, 4, 4, 4, 4, 4}, Z: app1: {4} target-user-limit=11, resource-to-obtain=23 After preemption (from user has most usage, the sequence is Y-V-W-X-Z): V: {4, 4} W: {4, 4, 4, 4} X: {4, 4, 4, 4} Y: {4, 4} Z: {4} {code} Still not very balanced, the ideal result should be: {code} V: {4, 4, 4} W: {4, 4, 4} X: {4, 4, 4} Y: {4, 4, 4} Z: {4} {code} In addition, this appoarch cannot resolve rule #2/#3 as well if target-user-limit is not appropriately computed. So I propose to do in another way, We should recompute used-resource - marked-preempted-resource every time for a user after making decision of preemption each container. Maybe we can use a priority queue here to store (used-resource - marked-preempted-resource) here. And we don’t need to compute a target user limit here. The pseudo code for preempting resource of a queue might look like: {code} compute resToObtain first; // first preempt task containers while (resToObtain 0) { pick a user-x which has most (used-resource - marked-preempted-resource) pick one container-y from user to preempted resToObtain -= container-y.resource } if (resToObtain = 0) { return; } // if more resource need to be preempted, we should preempt AM container while (resToObtain 0 total-am-resource - marked-preempted-am-resource max-am-percentage) { // do the same thing again: pick a user-x which has most (used-resource - marked-preempted-resource) pick one container-y from user to preempted resToObtain -= container-y.resource } {code} With this, we can make the un-balanced linear correlated with average-container-size only and solved the #2/#3 rules we should respect I mentioned before altogether. Mayank, do you think is it looks like a reasonable suggestion? Any other thoughts? [~vinodkv], [~curino], [~sunilg]. Thanks, Wangda CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch,
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075220#comment-14075220 ] Wangda Tan commented on YARN-2069: -- Hi Mayank, Thanks for your detailed explanation, I think I understood your approach. However, I think the current way to compute target user limit is not correct, let me explain: I found basically, your created {{computeTargetedUserLimit}} is modified from {{computeUserLimit}}, it will calculate as following {code} target_capacity = used_capacity - resToObtain min( max(target_capacity / #active_user, target_capacity * user_limit_percent), target_capacity * user_limit_factor)), {code} So when a user_limit_percent is set as default (100%), it is possible that target_user_limit * #active_user queue_max_capacity. In this case, it is possible that any of the user-usage is below target_user_limit, but the usage of the queue is larger than guaranteed resource. Let me give you an example {code} Assume queue capacity = 50, used_resource = 70, resToObtain = 20 So target_capacity = 50, there're 5 users in the queue user_limit_percent = 100%, user_limit_factor = 1 (both are default) So target_user_capacity = min(max(50 / 5, 50 * 100%), 50) = 50 User1 used 20 User2 used 10 User3 used 10 User4 used 20 User5 used 10 So all user's used capacity are target_user_capacity {code} In existing logic of {{balanceUserLimitsinQueueForPreemption}} {code} if (Resources.lessThan(rc, clusterResource, userLimitforQueue, userConsumedResource)) { // do preemption } else continue; {code} If a user used resource target_user_capacity, it will not be preempted. Mayank, is that correct? Or I misunderstood your logic? Please let me know you comments, Thanks, Wangda CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2362) Capacity Scheduler: apps with requests that exceed current capacity can starve pending apps
[ https://issues.apache.org/jira/browse/YARN-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075970#comment-14075970 ] Wangda Tan commented on YARN-2362: -- I think we should fix this, {code} if (!assignToQueue(clusterResource, required)) { -return NULL_ASSIGNMENT; +break; } {code} The {{return NULL_ASSIGNMENT}} statement means: if an app submitted earlier cannot allocate resource in a queue, the rest of apps in the queue cannot allocate resource in a queue too. The {{break}} looks better to me. And I agree this should be a duplicate of YARN-1631 Capacity Scheduler: apps with requests that exceed current capacity can starve pending apps --- Key: YARN-2362 URL: https://issues.apache.org/jira/browse/YARN-2362 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.4.1 Reporter: Ram Venkatesh Cluster configuration: Total memory: 8GB yarn.scheduler.minimum-allocation-mb 256 yarn.scheduler.capacity.maximum-am-resource-percent 1 (100%, test only config) App 1 makes a request for 4.6 GB, succeeds, app transitions to RUNNING state. It subsequently makes a request for 4.6 GB, which cannot be granted and it waits. App 2 makes a request for 1 GB - never receives it, so the app stays in the ACCEPTED state for ever. I think this can happen in leaf queues that are near capacity. The fix is likely in LeafQueue.java assignContainers near line 861, where it returns if the assignment would exceed queue capacity, instead of checking if requests for other active applications can be met. {code:title=LeafQueue.java|borderStyle=solid} // Check queue max-capacity limit if (!assignToQueue(clusterResource, required)) { -return NULL_ASSIGNMENT; +break; } {code} With this change, the scenario above allows App 2 to start and finish while App 1 continues to wait. I have a patch available, but wondering if the current behavior is by design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076042#comment-14076042 ] Wangda Tan commented on YARN-1707: -- Thanks for uploading the patch [~curino], [~subru]. They're great additions to current CapacityScheduler. I took a look at your patch, *First I have a couple of questions about its background, especially {{PlanQueue}}/{{ReservationQueue}} in this patch. I think understanding background is important for me to get a whole picture of this patch. What I can understand is,* # {{PlanQueue}} can have a normal {{ParentQueue}} as its parent, but all children of {{PlanQueue}} can only be {{ReservationQueue}}. Is it possible that multiple {{PlanQueue}} exist in the cluster? # {{PlanQueue}} is initially setup in configuration, as same as {{ParentQueue}}, it has absolute capacity, etc. But different from {{ParentQueue}}, it has user-limit/user-limit-factor, etc. # {{ReservationQueue}} is dynamically initialized by PlanFollower, when a new reservationId acquired, it will create a new {{ReservationQueue}} accordingly # {{PlanFollower}} can dynamically adjust queue size of {{ReservationQueue}}s to make resource reservation can be satisfied. # Is it possible that sum of reserved resource exceeds limit of {{PlanQueue}}/{{ReservationQeueu}} and preemption triggered? # How to deal with RM restart? It is possible that RM restart during resource reservation, we may need to consider how to persistent such queues Hope you could share your ideas about them. *For requirement of this ticket (copied from JIRA),* # create queues dynamically # destroy queues dynamically # dynamically change queue parameters (e.g., capacity) # modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% # move app across queues I found #1-#3 are dedicated used by {{PlanQueue}}, {{Reservation}}. IMHO, it should be better to added them to CapacityScheduler and don't couple them with ReservationSystem, but I cannot think about other solid senarios can leverage them. I hope to get feedbacks from community before we couple them with ReservationSystem. And as mentioned by [~acmurthy], can we merge add queue to existing add new queue mechanism? #4 should be only valid in {{PlanQueue}}. Because if we change this behavior in {{ParentQueue}}, it is possible that some careless admin will mis-setting capacities of queues under a parent queue, if sum of their capacity don't equals to 1, some resource may not be able to be used by applications. *Some other comments (Majorly about move app because we may need consider scope of create/destory queues first):* 1) I think we need consider how moving apps across queues work with YARN-1368. We can change queue of containers from queueA to queueB, but with YARN-1368, during RM restart, container will report it is in queueA (we don't sync them to NM when do moveApp operation). I hope [~jianhe] could share some thoughts about this as well. 2) Move application in CapacityScheduler need call finishApplication in resource queue and submitApplication in target queue to make QueueMetrics correct. And submitApplication will check ACL of target queue as well. 3) Should we respect MaxApplicationsPerUser in target queue when trying to move app? IMHO, we can stop moving app if MaxApplicationsPerUser reached in target queue. Thanks, Wangda Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076062#comment-14076062 ] Wangda Tan commented on YARN-2008: -- Hi Craig, As we discussed in YARN-1198, I think we should consider resource used by a queue's siblings when computing headroom, I took a look at your patch again, some comments: We first need think about how to calculate headroom in general, I think headroom is (concluded from sub JIRAs of YARN-1198), {code} queue_available = min(clusterResource - used_by_sibling_of_parents - used_by_this_queue, queue_max_resource) headroom = min(queue_available - available_resource_in_blacklisted_nodes, user_limit) {code} So I think this JIRA is focus on computing {{used_by_sibling_of_parents}}, is it? I think the general appoarch looks good to me, except In CSQueueUtils.java, (will include review of tests in next iteration): 1) {code} //sibling used is parent used - my used... float siblingUsedCapacity = Resources.ratio( resourceCalculator, Resources.subtract(parent.getUsedResources(), queue.getUsedResources()), parentResource); {code} It seems to me this computing not robust enough when parent resource is empty, no matter it's an zero-capacity queue or sibling of it used 100% of cluster. It's better to add an edge test case to prevent such zero-division as well. 2) It's better to explicitly cap {{return absoluteMaxAvail}} in range of \[0~1\] to prevent errors float computation. Thanks, Wangda CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077254#comment-14077254 ] Wangda Tan commented on YARN-1707: -- Hi [~subru], Thanks for your elaboration, it is very helpful for me to understand the background. Regards, Wangda Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077262#comment-14077262 ] Wangda Tan commented on YARN-415: - Hi [~eepayne], Thanks for updating your patch, For e2e test, I think we can do this way, you can refer to tests in TestRMRestart Using MockRM/MockAM can do such test, even though it's not a complete e2e test, but most logic are included in it. I suggest we could cover following cases: {code} * Create an app, before submit AM, resource utilization should be 0 * Submit AM, while AM running, we can get its resource utilization 0 * Allocate some container, and finish them, check total resource utilization * Finish application attempt, and check total resource utilization * Start a new application attempt, check if resource utilization of previous attempt is added to total resource utilization. * Check if resource utilization can be persist/read during RM restart {code} Do you have any comments on this? Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077296#comment-14077296 ] Wangda Tan commented on YARN-1707: -- Hi [~curino], Thanks for your reply, For regarding how the patch matches the JIRA: Since I don't have other solid use cases in my mind that others besides {{ReservationSystem}} can leverage these features, I don't have strong opinions to merge such dynamic behaviors into {{ParentQueue}}, {{LeafQueue}}. Let's wait for more feedbacks. I agree that we can consider queue capacity as a weight, it will be easier for users to configure, and it's a backward-compatible change also (except it will not throw exception when sum of children of a {{ParentQueue}} doesn't equals to 100). bq. As I was mentioning in my previous comment, this is likely fine for the limited usage we will make of this from ReservationSystem I think for moving application across queue is not a ReservationSystem specific change. I would suggest to check it will not violate restrictions in target queue before moving it. Thanks, Wangda Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2215) Add preemption info to REST/CLI
[ https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2215: - Assignee: Kenji Kikushima Add preemption info to REST/CLI --- Key: YARN-2215 URL: https://issues.apache.org/jira/browse/YARN-2215 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Kenji Kikushima Attachments: YARN-2215.patch As discussed in YARN-2181, we'd better to add preemption info to RM RESTful API/CLI to make administrator/user get more understanding about preemption happened on app/queue, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2215) Add preemption info to REST/CLI
[ https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077303#comment-14077303 ] Wangda Tan commented on YARN-2215: -- Hi [~kj-ki], Thanks for working on this, I've assigned this JIRA to you. I think the fields you added should be fine. With the scope of this JIRA, I think it's better to add CLI support as well. Please submit patch to kickoff jenkins when you completed. Wangda Add preemption info to REST/CLI --- Key: YARN-2215 URL: https://issues.apache.org/jira/browse/YARN-2215 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Kenji Kikushima Attachments: YARN-2215.patch As discussed in YARN-2181, we'd better to add preemption info to RM RESTful API/CLI to make administrator/user get more understanding about preemption happened on app/queue, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080388#comment-14080388 ] Wangda Tan commented on YARN-2008: -- Hi [~cwelch], Thanks for uploading patch, +1 for putting isInvalidDivisor to {{ResourceCalculator}}. I would suggest to add some resource usage to L2Q1 in {{testAbsoluteMaxAvailCapacityWithUse}}, and see if L2Q2 can get correct maxAbsoluteAvailableCapacity. Thanks, Wangda CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081718#comment-14081718 ] Wangda Tan commented on YARN-2008: -- Hi [~cwelch], I found the patch you updated is identical with *.3.patch, could you please check? Thanks CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081844#comment-14081844 ] Wangda Tan commented on YARN-2008: -- Hi [~cwelch], Thanks for updating, now tests can cover all cases I can think about, A very minor comment: Could you please add a small ε for all {{assertEquals}} like following? bq. +assertEquals( 0.1f, result, 0.01f); Thanks, Wangda CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch, YARN-2008.5.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081845#comment-14081845 ] Wangda Tan commented on YARN-2069: -- Hi [~mayank_bansal], Thanks for uploading, reviewing it now. Wangda CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, YARN-2069-trunk-8.patch, YARN-2069-trunk-9.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084245#comment-14084245 ] Wangda Tan commented on YARN-2008: -- Thanks [~cwelch] for updating, LGTM, +1 Wangda CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure - Key: YARN-2008 URL: https://issues.apache.org/jira/browse/YARN-2008 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Chen He Assignee: Craig Welch Attachments: YARN-2008.1.patch, YARN-2008.2.patch, YARN-2008.3.patch, YARN-2008.4.patch, YARN-2008.5.patch, YARN-2008.6.patch, YARN-2008.7.patch If there are two queues, both allowed to use 100% of the actual resources in the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and there is not actual space available. If we use current method to get headroom, CapacityScheduler thinks there are still available resources for users in Q1 but they have been used by Q2. If the CapacityScheduelr has a hierarchy queue structure, it may report incorrect queueMaxCap. Here is a example ||||rootQueue|| || | | / | \ | | L1ParentQueue1 | | L1ParentQueue2| | (allowed to use up 80% of its parent)| | (allowed to use 20% in minimum of its parent)| |/ | \ || | L2LeafQueue1 |L2LeafQueue2 | | |(50% of its parent) | (50% of its parent in minimum) | | When we calculate headroom of a user in L2LeafQueue2, current method will think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. However, without checking L1ParentQueue1, we are not sure. It is possible that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084279#comment-14084279 ] Wangda Tan commented on YARN-2069: -- Hi [~mayank_bansal], Thanks for your patience. I've just read through your new patch. After #1/#2, if there's more resource need preempt, AM container will be preempted. Is it corect? Please let me know if I misread your approach. *I think we should discuss scope of this JIRA first, I'm a little confused after thought about it.* According to the desc of this JIRA, we need make sure: (Assume we calculated {{target-user-limit}} already). *REQ #1:* When consider preempt a container from user-x, if {{used-resource - marked-preempted-resource}} of user-x already = {{target-user-limit}}. We need make sure, no any other user in the queue has {{used-resource - marked-preempted-resource}} {{target-user-limit}}. *REQ #2:* When we have to preempt an AM container, we need make sure #1 too. And as commented by [~vinodkv]: https://issues.apache.org/jira/browse/YARN-2069?focusedCommentId=14064047page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14064047. *REQ #3:* User's resource after preemption should be as balanced as possible around {{target-user-limit}} Do you agree with these requirements? I think we should update requirements to JIRA desc if we decided. * My understanding of your new patch consists of two phases:* 1. {{distributePreemptionforUsers}} will do preemption to enforce {{target-user-limit}} for each user. 2. If there's more resource need preempted, will call {{distributePreemptionforUsers}} to make sure {{resToObtain}} is distributed to {{resToObtain}} divide {{#active-user}} in the queue. I think phase-1 can enforce REQ#1. But phase-2 cannot enforce REQ#3. And also, REQ#2 cannot be satisfied in the patch. Let me give you an example about why REQ#3 not satisfied, similar to Vinod's example: {code} Queue has guaranteed resource = 30%, now it used 60%, want to shrink it down to 40%. Container size are equal, which is 3% of the cluster. Now 5 app in the queue, user-limit configured to 20%. So expected resource are {8%, 8%, 8%, 8%, 8%}. Before preemption: {15%, 9%, 12%, 12%, 12%} It is possible after preemption in your current appoarch: {15%, 6%, %6, %6, %6} (total is 39%) {code} Sometimes we cannot get all user's resource exactly same to {{target-user-limit}} because contianer size may not divisible by {{target-user-limit}}. But we can do better in following example {code} After preemption: {9%, 9%, %9, %6, %6} (total is 39%) {code} The unbalanced happened caused by accumulated bias I mentioned in my comment: https://issues.apache.org/jira/browse/YARN-2069?focusedCommentId=14074249page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14074249 Thanks, Wangda CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, YARN-2069-trunk-8.patch, YARN-2069-trunk-9.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2378) Adding support for moving apps between queues in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087723#comment-14087723 ] Wangda Tan commented on YARN-2378: -- Hi [~subru], Thanks for uploading patch, I took a look at your patch. As mentioned by [~vvasudev], there's an other JIRA (YARN-2248) related to moving. I think two JIRAs has different advantages, I hope you can decide how to merge your works. - YARN-2378 covers RMApp related changes, which should be done while moving - YARN-2248 covers more tests for queue-metrics. I think another major difference is, YARN-2248 will check queue capacity before moving and YARN-2378 not. I had a discussion with [~curino] offline about this, here I paste what he said: {code} Imagine I have a busy cluster an want to migrate apps from queue A to queue B. Since we do not provide any transactional semantics from the CLI it would be quite hard to make sure I can move an app (even if I kill everything in a queue B, and then invoke move A-B, more apps might show up and crowd the target queue B before I can successfully move). Having move to be more sturdy and succeed right away, and enhance preemption (if needed) to repair invariants seems a better option in this scenario. I think preemption already would already enforce max capacity, other active JIRAs should deal with user-limit as well. More generally I think eventually preemption can be our universal rebalancer/enforcer, allowing us to play a bit more fast an loose with move/resizing of queues. {code} I agree with this, another example is when refresh queue capacity, some queues may be shrunk to lower than its guaranteed/used resource. We will not stop such queue refresh, and preemption will also take care this. Some other comments about YARN-2378 1) I think we should implement state store in move transition: {code} // TODO: Write out change to state store (YARN-1558) // Also take care of RM failover moveEvent.getResult().set(null); {code} 2) There’re lots of test failure, I’m afraid it broke some major logic, could you please check it? Will include test review in next iteration. Thanks, Wangda Adding support for moving apps between queues in Capacity Scheduler --- Key: YARN-2378 URL: https://issues.apache.org/jira/browse/YARN-2378 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Subramaniam Venkatraman Krishnan Labels: capacity-scheduler Attachments: YARN-2378.patch As discussed with [~leftnoteasy] and [~jianhe], we are breaking up YARN-1707 to smaller patches for manageability. This JIRA will address adding support for moving apps between queues in Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues
[ https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088912#comment-14088912 ] Wangda Tan commented on YARN-2248: -- [~keyki], I agree we should get move-app committed in 2.6.0. Capacity Scheduler changes for moving apps between queues - Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Assignee: Janos Matyas Fix For: 2.6.0 Attachments: YARN-2248-1.patch, YARN-2248-2.patch, YARN-2248-3.patch We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088965#comment-14088965 ] Wangda Tan commented on YARN-2249: -- Hi [~jianhe], Thanks for working on the patch, I've read your patch, several comments/questions 1) I haven't followed work preserving restart discussions for a long time. How current RM handle the problem: after RM restarted, it started allocate resource, and NM report container recover, but there's no resource available in a node/queue? I remember we've discussed this topic while you were working on YARN-1368, which is RM will not allocate new resource for x secs after restart for NM can reconnect and recover containers. If you chose that appoarch, we can cache outstanding container release request until x secs after restart reached. And could you elaborate why you use NM liveness expire time? Can we improve this? 2) It seems to me using {code} +this.pendingRelease = +CacheBuilder.newBuilder().expireAfterWrite {code} Is not a good enough because it will cache every release request from AM. Actually, we only need cache release request for a period of time after AM reconnected to RM. After the time reaches, release logic should behave as before. 3) I think we shouldn't {{logFailure}} for rmContainer not found in this case. IMHO, we should {{logFailure}} when release request removing from cache instead. 4) We should notify AM about container completed message when we decide to not recover a container. And we should add this to test as well. 5) Test, Can we wait for some state instead of {{Thread.sleep(3000);}}? Thanks, Wangda RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089002#comment-14089002 ] Wangda Tan commented on YARN-415: - Hi Eric, Thanks for your hard working to add to add these e2e tests, bq. However, I had trouble setting up a test with more than one attempt for the same app. I think I covered the rest. I suggest you can refer to {{TestAMRestart#testAMRestartWithExistingContainers}} as an example. Please let me know if you still have problem to set up multi-attempt test. Some minor suggestions: 1) {code} + private final static File TEMP_DIR = new File(System.getProperty( + test.build.data, /tmp), decommision); {code} I think it didn't use by the test 2) bq. +Assert.assertTrue(YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS 1); We don't need this assert 3) bq. +System.out.println(EEP 001); It's better to remove such personal debug-info. 4) {code} +conf.setInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, +YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {code} It's better to put such logic to {{setup}} And please update your patch against trunk, Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-807) When querying apps by queue, iterating over all apps is inefficient and limiting
[ https://issues.apache.org/jira/browse/YARN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090098#comment-14090098 ] Wangda Tan commented on YARN-807: - Hi [~sandyr], While reading comment of YARN-2385: https://issues.apache.org/jira/browse/YARN-2385?focusedCommentId=14089936page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14089936. I found the difference between CapacityScheduler/FairScheduler getAppsInQueues was made by this patch, which is FairScheduler will return all apps, and CapacityScheduler will only return active apps. Is there any special considerations for made this different? Do you think is it fine to change CapacityScheduler's behavior to return active+pending apps? Hope to get your idea about this :) Thanks, Wangda When querying apps by queue, iterating over all apps is inefficient and limiting - Key: YARN-807 URL: https://issues.apache.org/jira/browse/YARN-807 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.3.0 Attachments: YARN-807-1.patch, YARN-807-2.patch, YARN-807-3.patch, YARN-807-4.patch, YARN-807.patch The question which apps are in queue x can be asked via the RM REST APIs, through the ClientRMService, and through the command line. In all these cases, the question is answered by scanning through every RMApp and filtering by the app's queue name. All schedulers maintain a mapping of queues to applications. I think it would make more sense to ask the schedulers which applications are in a given queue. This is what was done in MR1. This would also have the advantage of allowing a parent queue to return all the applications on leaf queues under it, and allow queue name aliases, as in the way that root.default and default refer to the same queue in the fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2385) Adding support for listing all applications in a queue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090100#comment-14090100 ] Wangda Tan commented on YARN-2385: -- Hi Subru, I've commented on YARN-807, https://issues.apache.org/jira/browse/YARN-807?focusedCommentId=14090098page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14090098 about this. I hope we can get some suggestion from [~sandyr] as well. Thanks, Wangda Adding support for listing all applications in a queue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: abstractyarnscheduler This JIRA proposes adding a method in AbstractYarnScheduler to get all the pending/active applications. Fair scheduler already supports moving a single application from one queue to another. Support for the same is being added to Capacity Scheduler as part of YARN-2378 and YARN-2248. So with the addition of this method, we can transparently add support for moving all applications from source queue to target queue and draining a queue, i.e. killing all applications in a queue as proposed by YARN-2389 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-807) When querying apps by queue, iterating over all apps is inefficient and limiting
[ https://issues.apache.org/jira/browse/YARN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090183#comment-14090183 ] Wangda Tan commented on YARN-807: - Hi [~sandyr], Thanks for your comment. If you think it's a bug, we can resolve it in YARN-2385. If desirable behavior of this patch is to want running/completed app returned when querying by queue. We might need go to check all RMApp in RMContext, because apps will be removed from scheduler after it completed. We may need to create a Mapqueue-name, app-id in RMContext. Do you think is it a doable approach? Thanks, Wangda When querying apps by queue, iterating over all apps is inefficient and limiting - Key: YARN-807 URL: https://issues.apache.org/jira/browse/YARN-807 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.3.0 Attachments: YARN-807-1.patch, YARN-807-2.patch, YARN-807-3.patch, YARN-807-4.patch, YARN-807.patch The question which apps are in queue x can be asked via the RM REST APIs, through the ClientRMService, and through the command line. In all these cases, the question is answered by scanning through every RMApp and filtering by the app's queue name. All schedulers maintain a mapping of queues to applications. I think it would make more sense to ask the schedulers which applications are in a given queue. This is what was done in MR1. This would also have the advantage of allowing a parent queue to return all the applications on leaf queues under it, and allow queue name aliases, as in the way that root.default and default refer to the same queue in the fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090250#comment-14090250 ] Wangda Tan commented on YARN-415: - Hi [~eepayne], It's great to have so much cleanups in your new patch, I think it almost looks good to me, One minor comment: I found {{testUsageAfterAMRestartWithMultipleContainers}} and {{testUsageAfterAMRestartKeepContainers}} are very similar, could you find a way to create a common test method for them, only difference is passing a boolean keepContainer as parameter? Thanks, Wagda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2378) Adding support for moving apps between queues in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090314#comment-14090314 ] Wangda Tan commented on YARN-2378: -- Hi [~subru], It's good to have more test from YARN-2248, now I think it covers most cases, Several comments about tests: 1) testMoveAppForMoveToQueueCannotRunApp: I think the name may not be precise enough. Actually, it means moving an app from a small queue (cannot allocate more resource) to a larger queue. I suggest you can change the name And the comment is incorrect: {code} +// task_0_0 task_1_0 allocated, used=4G +nodeUpdate(nm_0); {code} The used should = 2G here. 2) testMoveAllApps: bq. +Thread.sleep(100); I think we don't need sleep here, moveApplication is a synchronized call. Thanks, Wangda Adding support for moving apps between queues in Capacity Scheduler --- Key: YARN-2378 URL: https://issues.apache.org/jira/browse/YARN-2378 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Subramaniam Venkatraman Krishnan Labels: capacity-scheduler Attachments: YARN-2378.patch, YARN-2378.patch As discussed with [~leftnoteasy] and [~jianhe], we are breaking up YARN-1707 to smaller patches for manageability. This JIRA will address adding support for moving apps between queues in Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090318#comment-14090318 ] Wangda Tan commented on YARN-2249: -- Hi [~jianhe], Thanks for update, several minor comments: AbstractYarnScheduler.java 1. {code} + private final Object object = new Object(); {code} Please change this to a more meaningful name. 2. {code} +}, yarnConf.getInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, + YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS)); {code} I found this is used several times, it's better to make it as a member of AYS Thanks, Wangda RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-807) When querying apps by queue, iterating over all apps is inefficient and limiting
[ https://issues.apache.org/jira/browse/YARN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090440#comment-14090440 ] Wangda Tan commented on YARN-807: - bq. It's also worth considering only holding this map for completed applications, so we don't need to keep two maps for running applications. I suggest we can do this way: 1) Rename scheduler side getAppsInQueue to getRunningAppsInQueue 2) Create MapQueue-name, SetApp-ID in RMContext, it will contain completed/running apps. The benefit to store them separately is we don't need query two places while client want to get applications. And getRunningAppsInQueue in scheduler side will be used when we need query running apps in queue like YARN-2378. Thanks, Wangda When querying apps by queue, iterating over all apps is inefficient and limiting - Key: YARN-807 URL: https://issues.apache.org/jira/browse/YARN-807 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.3.0 Attachments: YARN-807-1.patch, YARN-807-2.patch, YARN-807-3.patch, YARN-807-4.patch, YARN-807.patch The question which apps are in queue x can be asked via the RM REST APIs, through the ClientRMService, and through the command line. In all these cases, the question is answered by scanning through every RMApp and filtering by the app's queue name. All schedulers maintain a mapping of queues to applications. I think it would make more sense to ask the schedulers which applications are in a given queue. This is what was done in MR1. This would also have the advantage of allowing a parent queue to return all the applications on leaf queues under it, and allow queue name aliases, as in the way that root.default and default refer to the same queue in the fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-807) When querying apps by queue, iterating over all apps is inefficient and limiting
[ https://issues.apache.org/jira/browse/YARN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090455#comment-14090455 ] Wangda Tan commented on YARN-807: - Hi Sandy, Thanks for your elaboration. As you said, I agree we need to go through scheduler according to two capabilities you mentioned. Maybe a possible way is saving completed app in leaf queue as you mentioned, I remember now YARN will evict some apps when total number of apps exceeds a configuration number (like 10,000). We should do such evicting for completed app in leaf queue as well. When querying apps by queue, iterating over all apps is inefficient and limiting - Key: YARN-807 URL: https://issues.apache.org/jira/browse/YARN-807 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.3.0 Attachments: YARN-807-1.patch, YARN-807-2.patch, YARN-807-3.patch, YARN-807-4.patch, YARN-807.patch The question which apps are in queue x can be asked via the RM REST APIs, through the ClientRMService, and through the command line. In all these cases, the question is answered by scanning through every RMApp and filtering by the app's queue name. All schedulers maintain a mapping of queues to applications. I think it would make more sense to ask the schedulers which applications are in a given queue. This is what was done in MR1. This would also have the advantage of allowing a parent queue to return all the applications on leaf queues under it, and allow queue name aliases, as in the way that root.default and default refer to the same queue in the fair scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091658#comment-14091658 ] Wangda Tan commented on YARN-2249: -- Jian, Thanks for update, My last comment is, Could you rename {{mutex}} to {{pendingReleaseMutex}} or something? Wangda RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092301#comment-14092301 ] Wangda Tan commented on YARN-2308: -- [~lichangleo], Thanks for working on this, I took a quick scan at your patch, I think the general approach should be fine. Some minor suggestions: 1) {code} +if (application==null) { + LOG.info(can't retireve application attempt); + return; +} {code} Please leave a space before and after ==, Use LOG.error instead of info 2) Test code 2.1 bq. +System.out.println(testing queue change!!!); Remove this plz, 2.2 {code} +conf.setBoolean(CapacitySchedulerConfiguration.ENABLE_USER_METRICS, true); +conf.set(CapacitySchedulerConfiguration.RESOURCE_CALCULATOR_CLASS, {code} We may not need this too 2.3 {code} +// clear queue metrics +rm1.clearQueueMetrics(app1); {code} Also this 2.4 It's better to wait and check for app state transition to Failed after it rejected 2.5 I think this test isn't work-preserving restart specific problem, it's better to place the test in TestRMRestart Please let me know if you have any comment on them. Thanks, Wangda NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Attachments: jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092303#comment-14092303 ] Wangda Tan commented on YARN-415: - [~eepayne], bq. I created a common method that both of these call. Thanks! bq. I also noticed that testUsageWithMultipleContainers was doing similar things to testUsageAfterRMRestart, so I combined them both into testUsageWithMultipleContainersAndRMRestart. Good catch, I don't have further comments, but would you please check test failure above? Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2378) Adding support for moving apps between queues in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093695#comment-14093695 ] Wangda Tan commented on YARN-2378: -- [~subru], I've ran the previous failed test locally, it passed. And as same as the latest Jenkins result. I think LGTM, +1. [~jianhe], would you like to take a look at this? Thanks, Wangda Adding support for moving apps between queues in Capacity Scheduler --- Key: YARN-2378 URL: https://issues.apache.org/jira/browse/YARN-2378 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Subramaniam Venkatraman Krishnan Labels: capacity-scheduler Attachments: YARN-2378.patch, YARN-2378.patch, YARN-2378.patch As discussed with [~leftnoteasy] and [~jianhe], we are breaking up YARN-1707 to smaller patches for manageability. This JIRA will address adding support for moving apps between queues in Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093696#comment-14093696 ] Wangda Tan commented on YARN-415: - [~jianhe], would you like to take a look at it? Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093714#comment-14093714 ] Wangda Tan commented on YARN-2308: -- [~lichangleo], Thanks for updating, I think following line is not necessary bq. +conf.setBoolean(YarnConfiguration.RM_WORK_PRESERVING_RECOVERY_ENABLED, true); I just tried in my local, remove it should be fine. Besides this, LGTM, +1. [~zjshen], do you have take a look at this? Thanks, Wangda NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Attachments: jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2414) RM web UI: app page will crash if app is failed before any attempt has been created
[ https://issues.apache.org/jira/browse/YARN-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095039#comment-14095039 ] Wangda Tan commented on YARN-2414: -- Assigned it to myself, will post a patch soon RM web UI: app page will crash if app is failed before any attempt has been created --- Key: YARN-2414 URL: https://issues.apache.org/jira/browse/YARN-2414 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Zhijie Shen Assignee: Wangda Tan {code} 2014-08-12 16:45:13,573 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/app/application_1407887030038_0001 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:460) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1191) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.lang.NullPointerException at
[jira] [Assigned] (YARN-2414) RM web UI: app page will crash if app is failed before any attempt has been created
[ https://issues.apache.org/jira/browse/YARN-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2414: Assignee: Wangda Tan RM web UI: app page will crash if app is failed before any attempt has been created --- Key: YARN-2414 URL: https://issues.apache.org/jira/browse/YARN-2414 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Zhijie Shen Assignee: Wangda Tan {code} 2014-08-12 16:45:13,573 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/app/application_1407887030038_0001 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:460) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1191) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:116) at
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095050#comment-14095050 ] Wangda Tan commented on YARN-2308: -- bq. I think we should catch exception in following code and return Failed directly. Currently the CapacityScheduler will create AppRejectedEvent when found queue not existed while recovering or submit. {code} if (queue == null) { String message = Application + applicationId + submitted by user + user + to unknown queue: + queueName; this.rmContext.getDispatcher().getEventHandler() .handle(new RMAppRejectedEvent(applicationId, message)); return; } {code} We cannot catch exception here, because now exception throw: {code} // Add application to scheduler synchronously to guarantee scheduler // knows applications before AM or NM re-registers. app.scheduler.handle(new AppAddedSchedulerEvent(app.applicationId, app.submissionContext.getQueue(), app.user, true)); {code} bq. That's what I meant. RMApp can choose to enter FAILED state directly and no need to add attempt any more. It will not add attempt here, because it will get rejected directly bq. RM_WORK_PRESERVING_RECOVERY_ENABLED=true reflects the failure case in the description, but I'm wondering why RM_WORK_PRESERVING_RECOVERY_ENABLED=false, the test is going to fail. App will anyway be rejected, won't it? I've tried this in my local again, it can get passed. Set RM_WORK_PRESERVING_RECOVERY_ENABLED=false is enough to cover what we want to verify. NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Attachments: jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095055#comment-14095055 ] Wangda Tan commented on YARN-2308: -- Typo: bq. We cannot catch exception here, because now exception throw: Should be We cannot catch exception here, because *no* exception throw: NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Attachments: jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2385) Adding support for listing all applications in a queue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096329#comment-14096329 ] Wangda Tan commented on YARN-2385: -- I think we might not need maintain completed apps CS and Fair after thought about it. Maintain such fields is not original responsibility designed for scheduler. And for now, user can get completed container via REST API, that should be able to cover most use cases. Adding support for listing all applications in a queue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Venkatraman Krishnan Assignee: Karthik Kambatla Labels: abstractyarnscheduler This JIRA proposes adding a method in AbstractYarnScheduler to get all the pending/active applications. Fair scheduler already supports moving a single application from one queue to another. Support for the same is being added to Capacity Scheduler as part of YARN-2378 and YARN-2248. So with the addition of this method, we can transparently add support for moving all applications from source queue to target queue and draining a queue, i.e. killing all applications in a queue as proposed by YARN-2389 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100316#comment-14100316 ] Wangda Tan commented on YARN-2411: -- Ram, Thanks for updating, LGTM, +1. Wangda [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.demo.patch.1 Hi guys, Thanks for your input in the past several weeks, I implemented a patch based the design doc: https://issues.apache.org/jira/secure/attachment/12662291/Node-labels-Requirements-Design-doc-V2.pdf during the past two weeks. Really appreciate if you can take a look. The patch is: YARN-796.node-label.demo.patch.1 (I made a longer name to not confuse with other patches). *Already included in this patch:* * Protocol changes for ResourceRequest, ApplicationSubmissionContext (leveraged contribution from Yuliya's patch, thanks). also updated AMRMClient * RMAdmin changes to dynamically update labels of node (add/set/remove), also updated RMAdmin CLI * Capacity scheduler related changes including: ** headroom calculation, preemption, container allocation respect labels. ** Allow user set list of labels of a queue can access in capacity-scheduler.xml * A centralized node label manager can be updated dynamically to add/set/remove labels, and can store labels to file system. It will work with RM restart/HA scenario (Similar to RMStateStore). * Support set {{--labels}} option in distributed shell, we can use distributed shell to test this feature * Related unit tests *Will include later:* * RM REST APIs for node label * Distributed configuration (set labels in yarn-site.xml of NMs) * Support labels in FairScheduler *Try this patch* 1. Create a capacity-scheduler.xml with labels accessible on queues {code} root / \ ab || a1 b1 a.capacity = 50, b.capacity = 50 a1.capacity = 100, b1.capacity = 100 And a.label = red,blue; b.label = blue,green property nameyarn.scheduler.capacity.root.a.labels/name valuered, blue/value /property property nameyarn.scheduler.capacity.root.b.labels/name valueblue, green/value /property) {code} This means queue a (And its sub queues) CAN access label red and blue; queue b (And its sub queues) CAN access label blue and green 2. Create a node-labels.json locally, this is initial labels on nodes, (you can dynamically change it using rmadmin CLI while RM is running, you don't have to do it). And set {{yarn.resourcemanager.labels.node-to-label-json.path}} to {{file:///path/to/node-labels.xml}} {code} { host1:{ labels:[red, blue] }, host2:{ labels:[blue, green] } } {code} This sets red/blue labels on host1, and sets blue/green labels on host2 3. Start Yarn cluster (if you have several nodes in the cluster, you need launch HDFS to use distributed shell) * Submit a distributed shell: {code} hadoop jar path/to/*distributedshell*.jar org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command hostname -jar path/to/*distributedshell*.jar -num_containers 10 -labels red blue -queue a1 {code} This will run a distributed shell, launch 10 containers, and the command run is hostname, asked label is red blue, all containers will be allocated on host1, Some other examples: * {{-queue a1 -labels red green}}, this will be rejected, because queue a1 cannot access label green * {{-queue a1 -labels blue}}, some containers will be allocated on host1, and some others will be allocated to host2, because both of host1/host2 contain blue label * {{-queue b1 -labels green}}, all containers will be allocated on host2 4. Dynamically update labels using rmadmin CLI {code} // dynamically add labels x, y to label manager yarn rmadmin -addLabels x,y // dynamically set label x on node1, and label y on node2 yarn rmadmin -setNodeToLabels node1:x;node2:x,y // remove labels from label manager, and also remove labels on nodes yarn rmadmin -removeLabels x {code} *Two more examples for node label* 1. Labels as constraints: {code} Queue structure: root / | \ a b c a has label: WINDOWS, LINUX, GPU b has label: WINDOWS, LINUX, LARGE_MEM c doesn't have label 25 nodes in the cluster: h1-h5: LINUX, GPU h6-h10: LINUX, h11-h15: LARGE_MEM, LINUX h16-h20: LARGE_MEM, WINDOWS h21-h25: empty {code} If you want LINUX GPU resource, you should submit to queue-a, and set label in Resource Request to LINUX GPU If you want LARGE_MEM resource, and don't mind its OS, you can submit to queue-b, and set label in Resource Request to LARGE_MEM If you want to allocate on nodes don't have labels (h21-h25), you can submit it to any queue, and leave label in Resource Request empty 2. Labels to hard partition cluster {code} Queue structure: root / | \ a b c a has label: MARKETING b has label: HR c has label: RD 15 nodes in the cluster: h1-h5: MARKETING h6-h10: HR h11-h15: RD {code} Now cluster is hard partitioned to 3 small clusters, h1-h5 for marketing, only queue-A can use it, you should set label in Resource Request to a. Similar to HR/RD cluster. I appreciate your
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104926#comment-14104926 ] Wangda Tan commented on YARN-796: - bq. As I've said before, I basically want something similar to the health check code: I provide something executable that the NM can run at runtime that will provide the list of labels. If we need to add labels, it's updating the script which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere. I understand now, it's meaningful since it's a flexible way for admin to set labels in NM side. Maybe add a {{NodeLabelCheckerService}} to NM similar to {{NodeHealthCheckerService}} should work. I'll create a separated JIRA for setting labels in NM side under this ticket and leave design/implementation discussion here. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104931#comment-14104931 ] Wangda Tan commented on YARN-2056: -- May a another way to do this is add a config in per queue, like {{..queue-path.disable_preemption}}. And in {{ProportionalCapacityPreemptionPolicy#cloneQueues}}, if a queue's used capacity more than guaranteed resource, and it set disable preemption. We will not create a TempQueue for it. We will not require RM restart if queue property changed (queue property will be refreshed and PreemptionPolicy will get such changes. Does it make sense? Thanks, Wangda Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled
[ https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2434: - Summary: RM should not recover containers from previously failed attempt when AM restart is not enabled (was: RM should not recover containers from previously failed attempt) RM should not recover containers from previously failed attempt when AM restart is not enabled -- Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2434.1.patch If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled
[ https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105116#comment-14105116 ] Wangda Tan commented on YARN-2434: -- Jian, thanks for the patch, LGTM +1 RM should not recover containers from previously failed attempt when AM restart is not enabled -- Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2434.1.patch If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2433) Stale token used by restarted AM (with previous containers retained) to request new container
[ https://issues.apache.org/jira/browse/YARN-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105119#comment-14105119 ] Wangda Tan commented on YARN-2433: -- [~yingdachen], thanks for reporting this issue, I can take a look at this issue, will get you posted. Thanks, Wangda Stale token used by restarted AM (with previous containers retained) to request new container - Key: YARN-2433 URL: https://issues.apache.org/jira/browse/YARN-2433 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.4.1 Reporter: Yingda Chen Assignee: Wangda Tan With Hadoop 2.4, container retention is supported across AM crash-and-restart. However, after an AM is restarted with containers retained, it appears to be using the stale token to start new container. This leads to the error below. To truly support container retention, AM should be able to communicate with previous container(s) with the old token and ask for new container with new token. This could be similar to YARN-1321 which was reported and fixed earlier. ERROR: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02 STACK trace: {code} hadoop.ipc.ProtobufRpcEngine$Invoker.invoke org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103: Response - YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454: startContainers {services_meta_data { key: mapreduce_shuffle value: \000\0004\372 } failed_requests { container_id { app_attempt_id { application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2 } exception { message: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02 trace: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. \nNMToken for application attempt : appattempt_1408130608672_0065_01 was used for starting container with container token issued for application attempt : appattempt_1408130608672_0065_02\r\n\tat org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:48)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeStartRequest(ContainerManagerImpl.java:508)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainerInternal(ContainerManagerImpl.java:571)\r\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:538)\r\n\tat org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)\r\n\tat org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)\r\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\r\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)\r\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)\r\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)\r\n\tat java.security.AccessController.doPrivileged(Native Method)\r\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\r\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)\r\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)\r\n class_name: org.apache.hadoop.yarn.exceptions.YarnException } }} {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106669#comment-14106669 ] Wangda Tan commented on YARN-2345: -- Hi Hao, I think we already have a NodeCLI, which is yarn node -status nodeid as you said. We don't need add such method to RM admin CLI. RM admin CLI should only implement methods contained by ResourceManagerAdministrationProtocol. I would suggest to add more information when execute yarn node -all -list, like memory-used, CPU-used, etc. Just like RM web UI - nodes page. Thanks, Wangda yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie Attachments: YARN-2345.1.patch It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108645#comment-14108645 ] Wangda Tan commented on YARN-2345: -- [~aw], I agree with you, user doesn't need understand what happen inside. How about mark yarn node CLI deprecated, and add existing function to rmadmin CLI? yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie Attachments: YARN-2345.1.patch It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2385) Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108646#comment-14108646 ] Wangda Tan commented on YARN-2385: -- I think splitting them to two APIs make sense to me. It's more flexible and accurate. Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Krishnan Labels: abstractyarnscheduler Currently getAppsinQueue returns both pending running apps. The purpose of the JIRA is to explore splitting it to getRunningAppsInQueue + getPendingAppsInQueue that will provide more flexibility to callers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2448) RM should expose the name of the ResourceCalculator being used when AMs register
[ https://issues.apache.org/jira/browse/YARN-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108822#comment-14108822 ] Wangda Tan commented on YARN-2448: -- [~vvasudev], Thanks for working on the patch, it is LGTM, +1 Wangda RM should expose the name of the ResourceCalculator being used when AMs register Key: YARN-2448 URL: https://issues.apache.org/jira/browse/YARN-2448 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2448.0.patch, apache-yarn-2448.1.patch The RM should expose the name of the ResourceCalculator being used when AMs register, as part of the RegisterApplicationMasterResponse. This will allow applications to make better decisions when scheduling. MapReduce for example, only looks at memory when deciding it's scheduling, even though the RM could potentially be using the DominantResourceCalculator. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108932#comment-14108932 ] Wangda Tan commented on YARN-1707: -- Hi [~curino], Thanks for updating, I just took a look, some minor comments, 1) CapacityScheduler#removeQueue {code} if (disposableLeafQueue.getCapacity() 0) { throw new SchedulerConfigEditException(The queue + queueName + has non-zero capacity: + disposableLeafQueue.getCapacity()); } {code} removeQueue check disposableLeafQueue's capacity 0, but addQueue doesn't check. In addition, After previous check, ParentQueue#removeChildQueue/addChildQueue doesn't need check its capacity again. And they should throw same type of exception (both SchedulerConfigEditException or both IllegalArgumentException) 2) CS#addQueue {code} throw new SchedulerConfigEditException(Queue + queue.getQueueName() + is not a dynamic Queue); {code} Should dynamic Queue should be reservation queue comparing to similar exception throw in removeQueue? 3) CS#setEntitlement {code} if (sesConf.getCapacity() queue.getCapacity()) { newQueue.addCapacity((sesConf.getCapacity() - queue.getCapacity())); } else { newQueue .subtractCapacity((queue.getCapacity() - sesConf.getCapacity())); } {code} Maybe it's better to merge the add/substractCapacity to changeCapacity(delta) Or just create a setCapacity in ReservationQueue? 4) CS#getReservableQueues Is it better to rename it to getPlanQueues? 5) ReservationQueue#getQueueName {code} @Override public String getQueueName() { return this.getParent().getQueueName(); } {code} I'm not sure why doing this, could you please elaborate? This makes this.queueName and this.getQueueName has different semantic. 6) ReservationQueue#substractCapacity {code} this.setCapacity(this.getCapacity() - capacity); {code} With EPSILON, it is possible this.capacity 0 set substract, its better to cap this.capacity in range of [0,1]. Also addCapacity 7) DynamicQueueConf I think unfold it to two float as parameter for setEntitlement maybe more straigtforward, is it possible more fields will be add to DynamicQueueConf? 8) ParentQueue#setChildQueues Since only PlanQueue need sum of capacity = 1, I would suggest make this method protected, and PlanQueue can overwrite this method. Or add a check in ParentQueue#setChildQueues. Wangda Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108991#comment-14108991 ] Wangda Tan commented on YARN-1198: -- Hi [~cwelch], Thanks for updating, I went through your patch just now. I think the current approach makes more sense to me comparing to patch#4, it avoids iterating all apps when computing headroom. But currently, CapacityHeadroomProvider#getHeadroom will recompute headroom for each application heartbeat. Assume we have #application #user in a queue (the most possible case), it's still a little costly. I agree with the method which mentioned by Jason more: Specifically, we can create a map of user, headroom for each queue, when we need update headroom, we can update the all headroom in the map. And each SchedulerApplicationAttempt will hold a reference to headroom. The headroom in the map maybe as same as the {{HeadroomProvider}} in your patch. I would suggest to rename the {{HeadroomProvider}} to {{HeadroomReference}}, because we don't need do any computation in it anymore. Another benefit is, we don't need write HeadroomProvider for each scheduler. A simple HeadroomReference with getter/setter should be enough. Two more things we should take care with previous method: 1) As mentioned by Jason, currently, fair/capacity scheduler all support moving app between queues, we should recompute and change the reference after finished moving app. 2) In LeafQueue#assignContainers, we don't need call {code} Resource userLimit = computeUserLimitAndSetHeadroom(application, clusterResource, required); {code} For each application, and in LeafQueue#updateClusterResource iterate and update the map of user, headroom should be enough Wangda Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113124#comment-14113124 ] Wangda Tan commented on YARN-2056: -- Hi [~eepayne], Really sorry to come late, thanks for working on this, I just took a look at your method and patch, some comments: 1) I prefer to make per-queue disable preemption option follow the same config options in existing capacity-scheduler (same queue-path-prefix, etc.). 2) {{mockNested}} when(q.getQueuePath()) should consider hierarchy of queue as well 3) It's better to add tests for hierarchy of queues when preemption is disabled 4) In {{testPerQueueDisablePreemption}}, I think number of preemptions after enable queue-b's preemption is not very clear to me: {code} +// With no PREEMPTION_DISABLED set for queueB, get resources from both +// queueB and queueC (times() assertion is cumulative). +verify(mDisp, times(5)).handle(argThat(new IsPreemptionRequestFor(appB))); +verify(mDisp, times(16)).handle(argThat(new IsPreemptionRequestFor(appC))); {code} In the 2nd preemption, more resource reclaimed from appC than appB, I think it should get resource from app more, could you please take a look at what happened? I just afraid because we changed ideal resource calculation in 1st preemption, is it possible to affect 2nd preemption? Thanks, Wangda Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113147#comment-14113147 ] Wangda Tan commented on YARN-1707: -- Hi [~curino], Thanks for updating, I think current approach looks good to me, except Regarding 5, I just have a chat with Subru, as you mentioned, changing this is majorly making ReservationQueues not existed in user side. But I still concern about changing the semantic, since it's still a very important semantic of CSQueue. I hope to get more feedbacks about this before moving forward, I'll think about this myself as well. Thanks, Wangda Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113187#comment-14113187 ] Wangda Tan commented on YARN-1707: -- Thanks for sharing this, Carlo! It's very helpful to have such investigation result, any thoughts, [~jianhe]? Wangda Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2447) RM web services app submission doesn't pass secrets correctly
[ https://issues.apache.org/jira/browse/YARN-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113515#comment-14113515 ] Wangda Tan commented on YARN-2447: -- Hi [~vvasudev], I think the fix is very straight forward to me, previously secrets not properly set because of a typo. And now it can successfully get and set, modified test can also verify this. LGTM, +1, Wangda RM web services app submission doesn't pass secrets correctly - Key: YARN-2447 URL: https://issues.apache.org/jira/browse/YARN-2447 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2447.0.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116184#comment-14116184 ] Wangda Tan commented on YARN-1707: -- Carlo, thanks updating the patch. In addition to Jian's comment, I think the changes for displayQueueName looks good to me. I don't have further comments about this patch for now. Thanks, Wangda Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, YARN-1707.5.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116613#comment-14116613 ] Wangda Tan commented on YARN-2056: -- Hi [~eepayne], Thanks for updating your patch, bq. Do you mean that the prefix should be yarn.scheduler.capacity instead of yarn.resourcemanager.monitor.capacity.preemption? I have done this in this patch. Yeah bq. mockNested when(q.getQueuePath()) should consider hierarchy of queue as well Change makes sense to me bq. testPerQueueDisablePreemption Now this is more clear Looking forward a test for disable hierarchy queue preemption. Wangda Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119515#comment-14119515 ] Wangda Tan commented on YARN-796: - Hi [~ViplavMadasu] Really thanks for reviewing patch and pointing this out, this patch is a little out-of-dated, I've noticed and fixed this issue already. I've attached a latest patch named YARN-796.node-label.consolidate.1.patch. And I'm working on split patches of this big patch, will update on this JIRA Wangda Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.1.patch Attached latest consolidated patch named YARN-796.node-label.consolidate.1.patch Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2492) [Umbrella] Allow for (admin) labels on nodes and resource-requests
Wangda Tan created YARN-2492: Summary: [Umbrella] Allow for (admin) labels on nodes and resource-requests Key: YARN-2492 URL: https://issues.apache.org/jira/browse/YARN-2492 Project: Hadoop YARN Issue Type: Task Components: api, client, resourcemanager Reporter: Wangda Tan Since YARN-796 is a sub JIRA of YARN-397, this JIRA is used to create and track sub tasks and attach split patches for YARN-796. Let's keep all over-all discussions on YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2492) [Umbrella] Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119658#comment-14119658 ] Wangda Tan commented on YARN-2492: -- Mark this JIRA is a clone of YARN-796 [Umbrella] Allow for (admin) labels on nodes and resource-requests --- Key: YARN-2492 URL: https://issues.apache.org/jira/browse/YARN-2492 Project: Hadoop YARN Issue Type: Task Components: api, client, resourcemanager Reporter: Wangda Tan Since YARN-796 is a sub JIRA of YARN-397, this JIRA is used to create and track sub tasks and attach split patches for YARN-796. Let's keep all over-all discussions on YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)