[ 
https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606934#comment-14606934
 ] 

chong chen commented on YARN-1042:
----------------------------------

A few comments and thoughts: 

1) YARN Application Model

The current YARN job model only has two layers of requests, 
application/application attempt and container request. Container requests are 
independent from each other during scheduling phase. This model seems to work 
fine for existing simple scenario. However, when dealing with gang scheduling 
or affinity/anti-affinity etc, this is not enough. Essentially, all these 
requests introduce relationship among container requests. To easily express 
relationship, it seems to be nature to have a group concept here. Basically, 
one application/attempt may include multiple group of container requests, and 
each group includes multiple container requests. 

There will be policies within the group. For instance, you can say for group1 
of container requests, application prefers to have anti-affinity policies over 
host, however, for group2, it would be nice to have affinity policy within 
rack.  Furthermore, with group concept, you can also express to say for group3, 
scheduler needs to make sure all container requests be satisfied before making 
the allocation decision or even more advanced, group4 needs to meet minimum 4 
container requests at the same time before making the allocation decision etc. 
There can be policies among groups. For instance, if group1 represents hBase 
masters, group2 represents region servers, I may not want two groups sharing 
the same host. 

The bottom line is, each container cannot be considered independently any more, 
there could be cases scheduler needs to consider them together when making the 
optimum placement decision. 

2) Container Request and Scheduling Flaw

To really handle affinity/gang scheduling properly, one should deal with this 
logic in scheduling and consider all container requests in the group as a 
whole. The logic should be in scheduler rather than RM layer and scheduler 
needs to know individual container requests within each group, so it can make 
proper scheduling decision. This leads to another potential design issue in 
current YARN. 

Currently, when AM sends container requests to RM and scheduler, it expands 
individual container requests into host/rack/any format. For instance, if I am 
asking for container request with preference "host1, host2, host3", assuming 
all are in the same rack rack1, instead of sending one raw container request to 
RM/Scheduler with raw preference list, it basically expand it to become 5 
different objects with host1, host2, host3, rack1 and any in there. When 
scheduler receives information, it basically already lost the raw request. This 
is ok for single container request, but it will cause trouble when dealing with 
multiple container requests from the same application. Consider this case:

6 hosts, two racks:

rack1 (host1, host2, host3) rack2 (host4, host5, host6)

When application requests two containers with different data locality 
preference:

c1: host1, host2, host4
c2: host2, host3, host5

This will end up with following container request list when client sending 
request to RM/Scheduler:

host1: 1 instance
host2: 2 instances
host3: 1 instance
host4: 1 instance
host5: 1 instance
rack1: 2 instances
rack2: 2 instances
any: 2 instances

During scheduling, assume when host1 heartbeat comes, scheduler assigns 
container to host1, before Application master receives the container and 
updates its requests, if next host heartbeat is host4, scheduler will assign 
again even though they belong to the same container. 

Fundamentally, it is hard for scheduler to make a right judgement  without 
knowing the raw container request. The situation will get worse when dealing 
with affinity and anti-affinity or even gang scheduling etc. 

Ideally, YARN resource allocation request should be changed to send in raw 
container requests instead of expanded one. It should be scheduler module 
responsibility to interpret raw request and build up optimized data structure 
to use it. 



> add ability to specify affinity/anti-affinity in container requests
> -------------------------------------------------------------------
>
>                 Key: YARN-1042
>                 URL: https://issues.apache.org/jira/browse/YARN-1042
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: Steve Loughran
>            Assignee: Arun C Murthy
>         Attachments: YARN-1042-demo.patch, YARN-1042-design-doc.pdf, 
> YARN-1042.001.patch, YARN-1042.002.patch
>
>
> container requests to the AM should be able to request anti-affinity to 
> ensure that things like Region Servers don't come up on the same failure 
> zones. 
> Similarly, you may be able to want to specify affinity to same host or rack 
> without specifying which specific host/rack. Example: bringing up a small 
> giraph cluster in a large YARN cluster would benefit from having the 
> processes in the same rack purely for bandwidth reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to