[ 
https://issues.apache.org/jira/browse/YARN-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620709#comment-15620709
 ] 

Wangda Tan commented on YARN-5139:
----------------------------------

Thanks [~asuresh] for the comments, all great points, let me try to explain:

1) Regarding to uniform distribution of applications:

First I want to make sure if I understand your question correctly, there're two 
different kinds of allocations:
1. Look at one node at a time for each resource request
2. Look at multiple nodes at a time for each resource request

For #1, it is same as today's async scheduling or scheduling with node 
heartbeat, we can get uniform distribution of allocations for free. So I assume 
your question is for #2.
For #2, we need to add node's utilization into consideration,
If you look at the implementation notes (I uploaded it to 
https://github.com/leftnoteasy/public-hadoop-repo/blob/global-scheduling-3/global-scheduling-explaination.md),
 there's a pesudo code snippet about how to match nodes to resource request 
inside application:
{code}
    // Filter clusterPlacementSet by given resource request, for example:
    // - Hard locality
    // - Anti-affinity / Affinity
    PlacementSet filteredPlacementSet = filter(clusterPlacementSet, request);

    // Sort filteredPlacement according to resouce-request
    for (node in sort(filteredPlacementSet, request)) {
       if (node.has_enough_available_resource()) {
          // If node has enough available resource to allocate this request
          // Return a proposal for allocate this container
       } else {
          // If node doesn't have enough available resource
          // Return a proposal for reserve the container
       }

       // Also, what could happen:
       // - Container released, for example, unnecessary reserved container
       // - Cannot find node, return NOTHING_ALLOCATED
    }
  }
{code}

The sort(filteredPlacementSet, request) could feed the nodes in an order which 
considers utilization.

2)
bq. Another thing that came to mind is that, given that you are kind of 
'late-binding' the request to a group of nodes ...

Great point, which we somehow missed in today's async scheduling implementation 
as well. One (relatively) simple thing we can do is to maintain an internal 
"scheduling state" for NodeManagers: if a NM doesn't do heartbeat for X seconds 
(e.g. X=10*NM-heartbeat-internal), we could stop allocating new containers to 
such NMs. And we can also recall allocated but not-yet acquired continers on 
such NMs.

One question from my side, IIRC, FS enables async scheduling by default (CS 
supports async scheduling but it has lots of issues so I didn't see anybody 
enables it in production). So I'm curios about in your estimation, by average, 
how many nodes could fail/lost in every hour for a 10K nodes cluster? If it 
happens very offen, any user complains about this issue for today's async 
scheduling?

> [Umbrella] Move YARN scheduler towards global scheduler
> -------------------------------------------------------
>
>                 Key: YARN-5139
>                 URL: https://issues.apache.org/jira/browse/YARN-5139
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>         Attachments: Explanantions of Global Scheduling (YARN-5139) 
> Implementation.pdf, YARN-5139-Concurrent-scheduling-performance-report.pdf, 
> YARN-5139-Global-Schedulingd-esign-and-implementation-notes-v2.pdf, 
> YARN-5139-Global-Schedulingd-esign-and-implementation-notes.pdf, 
> YARN-5139.000.patch, wip-1.YARN-5139.patch, wip-2.YARN-5139.patch, 
> wip-3.YARN-5139.patch, wip-4.YARN-5139.patch, wip-5.YARN-5139.patch
>
>
> Existing YARN scheduler is based on node heartbeat. This can lead to 
> sub-optimal decisions because scheduler can only look at one node at the time 
> when scheduling resources.
> Pseudo code of existing scheduling logic looks like:
> {code}
> for node in allNodes:
>    Go to parentQueue
>       Go to leafQueue
>         for application in leafQueue.applications:
>            for resource-request in application.resource-requests
>               try to schedule on node
> {code}
> Considering future complex resource placement requirements, such as node 
> constraints (give me "a && b || c") or anti-affinity (do not allocate HBase 
> regionsevers and Storm workers on the same host), we may need to consider 
> moving YARN scheduler towards global scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to