[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297133#comment-15297133 ] Arun Suresh commented on YARN-2877: --- Ah.. makes sense.. thanks for the clarification [~djp] > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295735#comment-15295735 ] Akira AJISAKA commented on YARN-2877: - Hi [~asuresh], would you update the fix version field of the cherry-picked jiras? In addition, CHANGES.txt was added when cherry-picking YARN-2882. I filed YARN-5126 to remove it. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291039#comment-15291039 ] Junping Du commented on YARN-2877: -- bq. Thanks for investigation Wangda Tan and Junping Du The most investigation work is done by Wangda. We should put all credit to him. :) bq. Not sure if I understand correctly, are you proposing that we should NOT declare new fields in sequence ? for eg. if the last field index is 10 for a struct in trunk, if we want to add a new field, we should set it as something like 15 and not 11 ? I think what Wangda's propose above is: next time when we meet the same situation: patch 1 go to trunk first but not branch-2, patch 2 need to go to branch-2 and they all change field of the same proto (assume patch 1's field id = 2, patch 2's field id =3 on trunk). We don't necessary to adjust the sequence in trunk any more like we do it earlier. Instead, on branch-2, we can keep patch 2's filed Id =3 and skip id = 2 which is reserved for patch 1 to commit to branch-2 in future. That can save our lives from possible incompatible commits due to branch differences. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15290580#comment-15290580 ] Arun Suresh commented on YARN-2877: --- [~jianhe], I just cherry-picked all sub-task patches from trunk to branch-2. Do let me know if you hit any issues. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15290572#comment-15290572 ] Arun Suresh commented on YARN-2877: --- Thanks for investigation [~leftnoteasy] and [~djp] bq. So next time we should not update sequence of fields in trunk/branch-2, what we need to do is to make sure fields of protos across branches has same id. Not sure if I understand correctly, are you proposing that we should NOT declare new fields in sequence ? for eg. if the last field index is 10 for a struct in trunk, if we want to add a new field, we should set it as something like 15 and not 11 ? > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289535#comment-15289535 ] Wangda Tan commented on YARN-2877: -- An additional note: Junping and I investigated PB compatible cases when adding new fields to middle of a proto. Let's say: PB1Proto: {code} optional int32 w = 1; optional int32 x = 2; optional int32 z = 4; {code} PB2Proto: {code} optional int32 w = 1; optional int32 x = 2; optional int32 y = 3; optional int32 z = 4; {code} PB2Proto can read PB1Proto without any issue. So next time we should not update sequence of fields in trunk/branch-2, what we need to do is to make sure fields of protos across branches has same id. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288066#comment-15288066 ] Konstantinos Karanasos commented on YARN-2877: -- Marked the JIRA as resolved and added a release note. Thank you all for the invaluable feedback, the contributions and the extensive code reviews! Among all the people that contributed, I would like to particularly call out [~asuresh], [~curino], [~chris.douglas], [~subru], [~kishorch], [~kasha], [~leftnoteasy], [~jianhe], and [~sriramsrao]. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287526#comment-15287526 ] Arun Suresh commented on YARN-2877: --- bq. So, to unblock YARN-4832 for 2.8, we'll add the new field before the ContainerQueuingLimitProto instead of after, Sure.. sounds good > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287521#comment-15287521 ] Jian He commented on YARN-2877: --- [~asuresh], [~kkaranasos], sounds good to me. So, to unblock YARN-4832 for 2.8, we'll add the new field before the ContainerQueuingLimitProto instead of after, and you can do the backport later on. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287501#comment-15287501 ] Arun Suresh commented on YARN-2877: --- [~jianhe], Actually we would prefer it being in branch-2 too.. I'd say if we can commit YARN-5090 too.. it would be nice ( [~leftnoteasy] has already given a +1 ) Aside from that.. I had planned atleast 2 more JIRAs before we can release # Documentation patch # Minor flag in MapReduce to allow end users to test Distributed Scheduling (say allow a percentage of Map tasks to requested as OPPORTUNISTIC.. with default being 0) We can definitely add the above 2 after pushing to branch-2 > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287499#comment-15287499 ] Konstantinos Karanasos commented on YARN-2877: -- Hi [~jianhe] and thanks for bringing this up. We are actually thinking to push our changes to branch-2 as well. As you say, we will focus on pushing first the patches that could cause incompatibilities, such as the ones you mentioned. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287489#comment-15287489 ] Jian He commented on YARN-2877: --- [~asuresh], [~kkaranasos], how possible is this going to branch-2 too ? At least, I see YARN-5073(refactoring) can be committed in branch-2. Asking this because this will cause code divergence from trunk and branch-2. And other patches may require to write two versions. One other thing is the newly added protocol buffer field, I remember that protocol buffer requires field tagged number to be the same for compatibility. Since A new field ContainerQueuingLimitProto is added in NodeHeartBeatResponse and occupied number 14, but that is only committed in trunk. If we need to add a new field in NodeHeartBeatResponse in branch-2 (YARN-4832) which also uses field number 14. This will cause branch-2 and trunk incompatible in NM heartbeat. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041169#comment-15041169 ] Konstantinos Karanasos commented on YARN-2877: -- Hi [~wangda], Thanks for pointing out HADOOP-11552. It seems it can also be used for the same purpose. I would suggest to follow the technique of frequent AM-LocalRM heartbeats and less frequent LocalRM-RM heartbeats to start with. Once HADOOP-11552 gets resolved, we can consider using it. bq. I think top-k node list technique cannot completely solve the over subscribe issue, in a production cluster, application comes in waves, it is possible that few large applications can exhaust all resources in a cluster within few seconds. Maybe another possible approach to mitigate the issue is: propagating queue-able containers from NM to RM periodically, so NM can still make decision but RM can also be aware of these queue-able containers. As long as k is sufficiently big, the phenomenon you describe should not be very pronounced. Moreover, corrective mechanisms (YARN-2888) will lead to moving tasks from highly-loaded nodes to less busy ones. Going further, what you are suggesting would also make sense. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038760#comment-15038760 ] Wangda Tan commented on YARN-2877: -- Hi [~kkaranasos], Thanks for reply: bq. We are planning to address this by having smaller heartbeat intervals in the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval to 200ms (in other words, we will only propagate to the RM only one in every four heartbeats). Maybe you could also take a look at HADOOP-11552, which could possibly achieve better latency and reduce heartbeat frequency. bq. This is a valid concern. The best way to minimize preemption is through the "top-k node list" technique described above. As the LocalRM will be placing the QUEUEABLE containers to the least loaded nodes, preemption will be minimized. I think top-k node list technique cannot completely solve the over subscribe issue, in a production cluster, application comes in waves, it is possible that few large applications can exhaust all resources in a cluster within few seconds. Maybe another possible approach to mitigate the issue is: propagating queue-able containers from NM to RM periodically, so NM can still make decision but RM can also be aware of these queue-able containers. bq. That said, as you also mention, QUEUEABLE containers are more suitable for short-running tasks, where the probability of a container being preempted is smaller. Ideally it's better to support all non-long-running-service tasks. LocalRM could allocate short-running queue-able tasks and RM an allocate other queue-able tasks. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037335#comment-15037335 ] Konstantinos Karanasos commented on YARN-2877: -- Thank you for the detailed comments, [~leftnoteasy]. Regarding #1: - Indeed the AM-LocalRM communication should be much more frequent than the LocalRM-RM (and subsequently AM-RM) communication, in order to achieve mili-second latency allocations. We are planning to address this by having smaller heartbeat intervals in the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval to 200ms (in other words, we will only propagate to the RM only one in every four heartbeats). We will soon create a sub-JIRA for this. - Each NM will periodically estimate its expected queue wait time (YARN-2886). This can simply be based on the number of tasks currently in its queue, or (even better) based on the estimated execution times of those tasks (in case they are available). Then, this expected queue wait time is pushed through the NM-RM heartbeats to the ClusterMonitor (YARN-4412) that is running as a service in the RM. The ClusterMonitor gathers this information from all nodes, periodically computes the least loaded nodes (i.e., with the smallest queue wait times), and adds that list to the heartbeat response, so that all nodes (and in turn LocalRMs) get the list. This list is then used during scheduling in the LocalRM. Note that simpler solutions (such as the power of two choices used in Sparrow) could be employed, but our experiments have shown that the above "top-k node list" leads to considerably better placement (and thus load balancing), especially when task durations are heterogeneous. Regarding #2: This is a valid concern. The best way to minimize preemption is through the "top-k node list" technique described above. As the LocalRM will be placing the QUEUEABLE containers to the least loaded nodes, preemption will be minimized. More techniques can be used to further mitigate the problem. For instance, we can "promote" a QUEUEABLE container to a GUARANTEED one in case it has been preempted more than k times. Moreover, we can dynamically set limits to the number of QUEUEABLE containers accepted by a node in case of excessive load due to GUARANTEED containers. That said, as you also mention, QUEUEABLE containers are more suitable for short-running tasks, where the probability of a container being preempted is smaller. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036916#comment-15036916 ] Wangda Tan commented on YARN-2877: -- Thanks [~kkaranasos], [~asuresh], I just caught up with latest design doc, my 2 cents: There're two major purpose of distributed RM, 1) get better allocation latency 2) leverage idle resources. #1 will be achieved when - AM -> LocalRM communication can be done within a single RPC call. (Doesn't do heartbeat like normal AM-RM allocation), otherwise it will be hard to achieve milli-seconds level latency. - LocalRM has enough information to allocate resource on a NM which could be directly used without waiting. I think stochastic + caching some information of other LocalRM could solve the problem. #2 can be achieved, but since the distributed RM solution doesn't have a global picture of resources and guaranteed containers can always preempt queueable containers. This could lead to excessive queueable containers preempted. If we can decide where to allocate queueable container from RM, RM could avoid a lots of such preemptions. (Instead of allocating on a node has lots of queueable containers, allocate on node with "real" idle resources). To me, this becomes a bigger issue if application wants to use opportunistic resources to run normal containers (such as a 10 min MR task). How to guarantee RM doesn't allocate more resources for a long time to LocalRM is a problem. IMO distributed RM is more suitable for short-lifed (few seconds) and low latency tasks. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241570#comment-14241570 ] Anubhav Dhoot commented on YARN-2877: - +1 for notion of distributed scheduling. I think it will go a long way for addressing both latency and scale goals for YARN. In my experience with using similar distributed scheduling systems we can run into following types of issues a) the node is currently full of running containers and the estimate of when capacity will free up for running queued requests could be hard/wrong. Your request might be queued a long time affecting latency of the queue-able container startup b) multiple LocalRMs could race to grab available space on a NM and one might get queued behind other requests having similar effects as a). For sake of discussion of mechanisms, I would suggest discussion of pros and cons for ability to 1) schedule queueable containers on multiple nodes, 2) ability to cancel queued requests Giving the power of at least 2 NM choices could address a lot of variability of queue-able container startup latency. One way is keep the queue of requests in the NM, but if needed, NMs ultimately confirm with the requesting LocalRM to ensure that the queued request is still valid. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241618#comment-14241618 ] Konstantinos Karanasos commented on YARN-2877: -- Thanks for the input, [~adhoot]. This is an interesting discussion. There are indeed cases that distributed scheduling can hurt job latency. This is more pronounced in the following cases: # Queueable containers are used both for short- and long-running tasks. # For Jobs that have many tasks (chances that one of these tasks will get stuck in a queue are higher). # Cluster load is higher. Based on the above situations, a first observation is that queueable containers should be mostly used for short-running tasks, if job latency is of importance. Moreover, when jobs have a big number of tasks, probably the AM policy should ask for optimistic containers only for a subset of them (even if they all are short-running). Still though, as you also mention, corrective mechanisms should be used to further improve latency. - One such mechanism is *queuing in multiple locations* as is done by Sparrow and Apollo. In that case the LocalRM should pick two nodes instead of one to queue the request. This is something we have not tried yet, but it may be useful to do so. - Another mechanism we are proposing is *queue rebalancing*, that is, whenever some queues have much bigger load than others, we dequeue some of its requests and send them to a less loaded queue. Of course, we need to take care when to dequeue containers, because we may end up increasing the latency if we accidentally dequeue the same request many times. - A last mechanism that seems interesting is *reordering of requests* within a queue, based on some policy (e.g., based on the submission time of the application the task belongs to). More thoughts are definitely welcome. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222823#comment-14222823 ] Devaraj K commented on YARN-2877: - +1 for the idea [~sriramsrao], [~curino] . I just wanted to know these if I am not missing something from the above. 1. If the OPTIMISTIC Container is assigned to AM, and also at the same time RM assigned a container i.e. CONSERVATIVE for the same resource, which one NM will consider and start it? 2. If the OPTIMISTIC Container is assigned to AM and started it, and NM receives a container start request for CONSERVATIVE and resources are not available, will the NM preempt the running OPTIMISTIC Containers or it will make CONSERVATIVE request to wait for completing the OPTIMISTIC Containers? 3. Any provision for AM to request OPTIMISTIC containers in the remote NM also? Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222872#comment-14222872 ] Konstantinos Karanasos commented on YARN-2877: -- [~wangda], regarding your question about how the AM will know which NM is more idle than others, this is related with YARN-2886. Each NM estimates its waiting queue time (based on the tasks running and those waiting in the queue already) and sends this waiting time to the RM through the heartbeat. Note that this is just an integer, so it is very lightweight. Then the RM can push this information to the rest of the NMs (again through the heartbeats). This way each node knows the queue status of the other NMs and can decide where to queue its queueable requests. However, since this information may be always precise (due to bad estimation or stale info), we also introduce correction mechanisms for rebalancing the queues, if need be (YARN-2888). Regarding your other questions: # These malicious AMs is one of the basic reasons we have introduced the Local RM. The AMs can make queueable requests only to the Local RM, who can throttle down aggressive AMs without even needing to reach the central RM. Clearly, as you mention, the central RM can also be involved for imposing elaborate fairness/capacity constraints, if those are needed. # Promoting a queueable container to a guaranteed-start one is indeed interesting, and we have been investigating the cases for which it would bring benefits. One is the case you mention. Another is in case a queueable container has been pre-empted/killed many times due to other guaranteed-start requests. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222876#comment-14222876 ] Konstantinos Karanasos commented on YARN-2877: -- [~devaraj], to answer your questions: # Guaranteed-start containers always have priority over queueable ones. Thus, in the case you describe, if not both requests can be accommodated by the NM, the guaranteed-start will start first. # If the queueable one was started before the guaranteed-start arrived, it will be pre-empted/killed for the guaranteed-start to begin execution. # Queueable requests are submitted by the AM in the Local RM running in the same node as the AM, but those requests can be queued at any NM of the cluster (we pick at each moment the most idle ones to queue those requests). Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222880#comment-14222880 ] Konstantinos Karanasos commented on YARN-2877: -- I used the wrong name in the above comment -- it was referring to [~devaraj.k]'s comment. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223197#comment-14223197 ] Carlo Curino commented on YARN-2877: I am going to echo [~kkaranasos] regarding malicious AMs. The key architectural change we propose is to introduce a proxy layer (YARN-2884). This is giving us a place that is both distributed, but part of the infrastructure (thus inherently trusted) where to enact policies. This is where we host the LocalRM functionality of YARN-2885. With this in place we do not have to depend on the trusting the AM regarding distributed decisions (the AM only exposes need for containers of different type). On the contrary, we can enable a broad spectrum of infrastructure-level policies, that can leverage explicit or implicit information to impose caps, or to balance (or skew) where the queuable containers should be allocated etc. As we have done in the past, we are working towards providing rather *general purpose mechanisms*, and propose a *first set of policies* (AM, LocalRM, NM start/stop of containers). Policies can be evolved/overridden easily depending on use-cases, while mechanisms are a little harder to change. To this end, discussing carefully other use cases, such as the conversation around using queuable containers for Impala, is very important, as we might have missed hooks as part of the mechanisms, that are necessary to support those scenarios. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222447#comment-14222447 ] Wangda Tan commented on YARN-2877: -- Thanks very much for explanation from [~kkaranasos], [~sriramsrao], I reply together, Now I can better understand the use case. Yes, the queueable containers not necessarily need to send to central RM. Except we want to add other features like queue balancing, etc. One more question, how AM can know which NM is more idle than others? Since simply querying NM status from every NM is not efficient enough. And I'm thinking the distributed scheduling could be integrated to existing scheduler, like Capacity Scheduler. Some other features could be added with this, # For now, we trust AM will make correct opportunistically container launch request. But considering a case like a large cluster has only few applications use opportunistically launch, others are conservative apps. It is possible an AM can steal a lot of resource from NMs from cluster by sending opportunistical launch request to all NMs. We can have a centralized RM combine resource usage of queueable/conservative containers to enforce fairness. And put malicious AMs to blacklists. # As the name of the title (distributed scheduling), we may be able to do more than opportunistically. For example, we can launch a opportunistically container in NM, but it is possible to become a consertive container after heartbeat to RM if the resource meets capacity settings for each queue. Thanks in advance! Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221990#comment-14221990 ] Konstantinos Karanasos commented on YARN-2877: -- [~Sujeet Varakhedi], also the Apollo paper (OSDI 2014) has interesting ideas about distributed scheduling. [~wangda], glad you like the idea and thanks for the interesting points. To answer your questions: 1. Apart from the limit that the LocalRM can impose in the number of queueable containers that each AM can receive (for which the central RM does not need to be involved), in the heartbeat response from the RM to the NM, information about the status of the other queues of the system will be passed as well. This way we will be able to impose global policies (such as capacity) in a distributed fashion. BTW this information is also used by the LocalRMs to decide in which NMs to queue requests. 2. If no policies need to be imposed, the central RM does not need to know anything about the queueable containers that each AM uses. Limits in the number of queueable containers per AM can be imposed directly by the LocalRM. However, in case fine-grained policies need to be imposed (as mentioned in point (1) above, such as number of queueable containers per queue in the capacity scheduler), the central RM can receive information about the number of queueable containers used by each AM, so that it imposes limits per queue. Clearly, the more information you pass to the central RM, the more powerful policies you can impose, but also the bigger the load you push to the central RM. So, there is a sweet-spot there based on the needs of each cluster. 3. This is a good point as well. Such information can be piggybacked in the heartbeats to the central RM (again, with the tradeoffs discussed above). Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221991#comment-14221991 ] Sriram Rao commented on YARN-2877: -- [~leftnoteasy] By definition, the allocation decisions made by the central RM win out. That is, whenever there is a conflict, *guaranteed-start* (or CONSERVATIVE) containers will be executed prior to *queueable* (or OPTIMISTIC) containers. This could also means that the NM may be forced to preempt running *queueable* containers to make room. Lastly, to allow some level of predictability in terms of execution time for *queueable* containers, we could use leases---a *queueable* container is allowed to execute for at most N secs even when there is conflict and if the container hasn't exited, the NM will preempt them after that time interval elapses (i.e., lease expires). This mechanism can allow minimizing preemption for *queueable* containers. Re: your other questions: # Capacity is enforced for *guaranteed-start* containers. For *queueable* containers, policies could be pushed down from central-RM ([YARN-2885|https://issues.apache.jira/browse/YARN-2885]) # It is not necessary that the *queueable* containers factor into central RM's allocation choices. That said, having that information at the central-RM can help minimize preemption. # For enabling load balancing of queues at the NM's (([YARN-2888|https://issues.apache.jira/browse/YARN-2888]), allow AM's to make choices of where to submit *queueable* containers ([YARN-2887|https://issues.apache.jira/browse/YARN-2887]), exposing queue information to local-RM's is desirable. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221012#comment-14221012 ] Konstantinos Karanasos commented on YARN-2877: -- Adding some more details, now that we have added the first sub-tasks. In YARN-2882 we introduce two *types of containers*: guaranteed-start and queueable. The former are the ones existing in YARN today (are allocated from the central RM, and once allocated, are guaranteed to start). The latter make it possible to queue container requests in the NMs and will be used for distributed scheduling. The *queuing of (queueable) container requests* in the NMs is proposed in YARN-2883. Each NM will now also have a *LocalRM* (Local ResourceManager) that will receive all container requests from the AMs running on the same machine: - For the guaranteed-start container requests, the LocalRM acts as a proxy (YARN-2884), forwarding them to the central RM. - For the queueable container requests, the LocalRM is responsible for sending them directly to the NM queues (bypassing the central RM). Deciding the NMs where these requests are queued is based on the estimated waiting time in the NM queues, as discussed in YARN-2886. Based on some policy (YARN-2887), each AM will determine *what type of containers to ask*: only guaranteed-start, only queueable, or a mix thereof. For instance, an AM may request guaranteed-start containers for its tasks that are expected to be long-running, whereas it may ask for queueable containers for its short tasks (in which the back-and-forth with the central RM may be longer than the task execution time). This way we reduce the scheduling latency, while increasing the utilization of the cluster (if we had to go to the central RM for all these short tasks, some resources of the cluster might remain idle in the meanwhile). To ensure the NM queues remain balanced, we propose *corrective mechanisms for NM queue rebalancing* in YARN-2888. Moreover, to ensure no AM is abusing the system by asking too many queueable containers, we can impose a limit in the *number of queueable containers* that each AM can receive (YARN-2889). Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221174#comment-14221174 ] Chen He commented on YARN-2877: --- This is a interesting idea. Distributed scheduling and global scheduling have their own pros and cons. For short, global scheduling can achieve optimal matching between tasks and resources but may have scalability problem when system becomes larger and larger. Distributed scheduling is scalable but may reach sub-optimal if there is no communication between those distributed schedulers. The LocalRM can reduce the RM's burden by doing communications to local AMs. It is a good idea. IMHO, the worker nodes become increasingly powerful and large (more mems and cores). Is that possible that the LocalRM affects NM's performance if there are many AMs running on a single server? Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221203#comment-14221203 ] Sriram Rao commented on YARN-2877: -- [~airbots] The number of AM's running on any machine is configurable/small---on the order of a few tens, and so the overhead on LocalRM should be negligible. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221309#comment-14221309 ] Sujeet Varakhedi commented on YARN-2877: + 1 for distributed scheduling and SQL engines for Hadoop can greatly benefit from it. We also need to look at a design we can give AMs more control over scheduling policies where RM just acts a source of overall cluster state, NM's have local queues and then based on NM queue wait times AM's can decide where to requests tasks. Similar to how Sparrow works. This kind of scheduling becomes important for services that need dedicated non-shared clusters like HBASE and HAWQ. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221745#comment-14221745 ] Wangda Tan commented on YARN-2877: -- Thanks [~sriramsrao] for bringing up the great idea and [~kkaranasos]/[~curino]'s explanations. Definitely we need such mechanisms to have low-latency container launching to support millisec-level-latency tasks. Some questions about this, # Since the LocalRMs will be totally distributed, does it still possible to enforce capacity between queues? # Will such opportunistical containers come to view of the central RM (used to schedule CONSERVATIVE containers)? ## If yes, will the central RM can decide if a opportunistical container is valid or not (saying #containers excesses the app's limitation)? And will the preemption still works for opportunistical containers ## If no, should we have someone to coordinate such containers? # Will central scheduler state (maybe not completely, but important info like queue used resource, etc.) broadcast to distributed LocalRMs? I think it might be usaful for LocalRMs to decide which opportunistical container should go first. Thanks in advance! Wangda Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219844#comment-14219844 ] Karthik Kambatla commented on YARN-2877: +1 to the idea, particularly to reduce the allocation latency. I definitely see Impala wanting to use this in the future. Not mentioned in the description, I believe scale is probably another big reason for distributed scheduling. bq. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. A centralized RM could schedule tasks opportunistically too? Is the intention to quickly adapt to changing resource usage on the node, and the latency due to NM-RM-NM communication being too long to loose this window of opportunity? Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219871#comment-14219871 ] Carlo Curino commented on YARN-2877: Karthik, you are correct... Karthik, glad you like the idea, and you ask good questions... This could be relevant to lower the load on the central RM (hence help with scale), in particular if we have a vast number of short-lived tasks (heavy scheduling cost for little work). (However, we have other ongoing work towards that, which we will post soon, hence the focus on utilization) What takes care of the fast adaption to node conditions is having a local queue (from which to pick more work if I am idle), and the notion of different containers types (i.e., I can kick out the optimistic containers if I am overbooked). With this in mind, the RM could be the one making scheduling decisions for queueable/optimistic containers as well, as you pointed out. What is constant (whether you make the scheduling decisions centrally or distributed), is the notion of different container types (see YARN-2882). This should be exposed to the AM, as it comes with very different level of guarantees on the container start/completion. Thus the AM need to know which type of containers to use for different tasks (e.g., short lived or non-critical-path containers can be optimistic). Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219874#comment-14219874 ] Sriram Rao commented on YARN-2877: -- [~kasha] (1) Yes, the central RM can allocate optimistic containers, however, as you note it introduces extra latency. (2) Scaling the RM's allocation particularly when you have small tasks is another motivation as well. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217667#comment-14217667 ] Steve Loughran commented on YARN-2877: -- Linking to HADOOP-11317 to cover project-wide use. I don't think yarn-common needs to explicitly declare a dependency on log4j, at least outside the test run. If you comment out that dependency —does everything still build? Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217668#comment-14217668 ] Steve Loughran commented on YARN-2877: -- (ignore that comment, was for YARN-2875) Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217207#comment-14217207 ] Sriram Rao commented on YARN-2877: -- The proposal: # Extend the NM to support task queueing. AM's can queue tasks directly at the NM's and the NM's will execute those tasks opportunistically. # Extend the type of containers that YARN exposes: #* CONSERVATIVE: This corresponds to containers allocated by YARN today. #* OPTIMISTIC: This corresponds to a new class of containers, which will be queued for execution at the NM. This extension allows AM's to control what type of container they are requesting from the RM framework. # Extend the NM with a local RM (i.e., a local Resource Manager) which uses local policies for deciding when an OPTIMISTIC container can be executed. We are exploring using timed leases for OPTIMISTIC containers to ensure minimum duration of execution. On the other hand, this mechanism allows NM's to free up resources and thus guarantee predictable start times for CONSERVATIVE containers. There are additional motivations for the uses of this feature and we will discuss them in follow-up comments. Extend YARN to support distributed scheduling - Key: YARN-2877 URL: https://issues.apache.org/jira/browse/YARN-2877 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Sriram Rao This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following: 1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. 2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)