[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571731#comment-13571731
 ] 

Robert Joseph Evans commented on YARN-371:
------------------------------------------

If you want to move this to yarn-dev@ that is fine with me.  My idea was not to 
support both protocols, but to support different in memory layouts.  There 
would just be the single task centric protocol, assuming that it is not too 
expensive, that the scheduler could then decided how it wanted to store the 
requests.  If it threw away the task centric nature and compressed it to be 
node centric then that is the decision of the scheduler, but it does not force 
other schedulers to do the same thing.

That is why I wanted the table.  I want to know what the overhead is going to 
be compared to what it is now. I don't think it is going to be that bad, 
because we are already sending the responses back in a container centric way, 
and so are all of the heartbeats from the node managers.  Now granted all of 
those are limited by the size of the cluster.  

Just back of the envelope estimates don't seem that horrible, not great but 
also not horrible.  Please correct my arithmetic if you see anything a miss, 
but 100,000 tasks at 150 bytes each (name of 3 nodes plus some extra stuff) is 
going to take about 15MB to transfer. To saturate a gigbit Ethernet connection 
would in theory take on the order of 6.5 million task requests/sec in the real 
world along with other traffic we probably would not want to go over 1 million 
task requests/sec.  And because the protocol is a delta protocol, just an 
assumption based off of how it currently works, that would be 1 million new 
task requests/sec or about 10 very large jobs being launched a second. If each 
of those containers were on average about 1 GB it would take a cluster with a 
churn of 1 PB/second to have the sustained 1 million requests/second. Or about 
250 containers finishing every second on every node in a 4000 node cluster.

If we move to 10GigE then there really is nothing to worry about from a 
networking perspective.

The next question after that would be how difficult is it to do this 
transformation.  Is it going to become a CPU bottleneck, or is the memory usage 
while the transformation is happening too high.  Well that is something that 
would really require some profiling to be sure what the impact is.  From what I 
see it seems like a reasonable change to create a quick prototype of, and see 
how it performs.
                
> Resource-centric compression in AM-RM protocol limits scheduling
> ----------------------------------------------------------------
>
>                 Key: YARN-371
>                 URL: https://issues.apache.org/jira/browse/YARN-371
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: api, resourcemanager, scheduler
>    Affects Versions: 2.0.2-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>
> Each AMRM heartbeat consists of a list of resource requests. Currently, each 
> resource request consists of a container count, a resource vector, and a 
> location, which may be a node, a rack, or "*". When an application wishes to 
> request a task run in multiple localtions, it must issue a request for each 
> location.  This means that for a node-local task, it must issue three 
> requests, one at the node-level, one at the rack-level, and one with * (any). 
> These requests are not linked with each other, so when a container is 
> allocated for one of them, the RM has no way of knowing which others to get 
> rid of. When a node-local container is allocated, this is handled by 
> decrementing the number of requests on that node's rack and in *. But when 
> the scheduler allocates a task with a node-local request on its rack, the 
> request on the node is left there.  This can cause delay-scheduling to try to 
> assign a container on a node that nobody cares about anymore.
> Additionally, unless I am missing something, the current model does not allow 
> requests for containers only on a specific node or specific rack. While this 
> is not a use case for MapReduce currently, it is conceivable that it might be 
> something useful to support in the future, for example to schedule 
> long-running services that persist state in a particular location, or for 
> applications that generally care less about latency than data-locality.
> Lastly, the ability to understand which requests are for the same task will 
> possibly allow future schedulers to make more intelligent scheduling 
> decisions, as well as permit a more exact understanding of request load.
> I would propose the tweak of allowing a single ResourceRequest to encapsulate 
> all the location information for a task.  So instead of just a single 
> location, a ResourceRequest would contain an array of locations, including 
> nodes that it would be happy with, racks that it would be happy with, and 
> possibly *.  Side effects of this change would be a reduction in the amount 
> of data that needs to be transferred in a heartbeat, as well in as the RM's 
> memory footprint, becaused what used to be different requests for the same 
> task are now able to share some common data.
> While this change breaks compatibility, if it is going to happen, it makes 
> sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to