[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571848#comment-13571848
 ] 

Robert Joseph Evans commented on YARN-371:
------------------------------------------

I agree that 150 bytes per task is large, but I was doing a back of the 
envelope estimate for worst case approximations.  Like you said there are ways 
around this using dictionaries and the like.  That is why it would be nice to 
see a working prototype so we know what it would be in reality.

To answer your question about large jobs that take the entire cluster this is 
very much a reality we have to deal with.  In 1.0 the JT would limit the 
maximum number of tasks that a single job could run.  We run with this set to 
100,000.  In 0.23 that limit is removed and it is entirely up to the amount of 
memory that your application master has available to it.  I did some work to 
reduce memory consumption in the MR AM try and be sure that an AM with the 
default heap size of about 1GB could still handle 100,000 tasks reasonably 
well.  But I have seen jobs with well over 100,000 map tasks.  The average size 
of jobs we see is actually much smaller, but it is not the average you have to 
worry about with memory.  It is the peak.  Unless you have explicit limits on 
the sizes you have the potential for a DOS attach, whether intentional or 
otherwise.

If the protocol does change you would need to also have the AM be able to limit 
the maximum number of requests that would be allowed.  I am not really sure how 
simple that is to do with Protocol Buffers. 
                
> Resource-centric compression in AM-RM protocol limits scheduling
> ----------------------------------------------------------------
>
>                 Key: YARN-371
>                 URL: https://issues.apache.org/jira/browse/YARN-371
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: api, resourcemanager, scheduler
>    Affects Versions: 2.0.2-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>
> Each AMRM heartbeat consists of a list of resource requests. Currently, each 
> resource request consists of a container count, a resource vector, and a 
> location, which may be a node, a rack, or "*". When an application wishes to 
> request a task run in multiple localtions, it must issue a request for each 
> location.  This means that for a node-local task, it must issue three 
> requests, one at the node-level, one at the rack-level, and one with * (any). 
> These requests are not linked with each other, so when a container is 
> allocated for one of them, the RM has no way of knowing which others to get 
> rid of. When a node-local container is allocated, this is handled by 
> decrementing the number of requests on that node's rack and in *. But when 
> the scheduler allocates a task with a node-local request on its rack, the 
> request on the node is left there.  This can cause delay-scheduling to try to 
> assign a container on a node that nobody cares about anymore.
> Additionally, unless I am missing something, the current model does not allow 
> requests for containers only on a specific node or specific rack. While this 
> is not a use case for MapReduce currently, it is conceivable that it might be 
> something useful to support in the future, for example to schedule 
> long-running services that persist state in a particular location, or for 
> applications that generally care less about latency than data-locality.
> Lastly, the ability to understand which requests are for the same task will 
> possibly allow future schedulers to make more intelligent scheduling 
> decisions, as well as permit a more exact understanding of request load.
> I would propose the tweak of allowing a single ResourceRequest to encapsulate 
> all the location information for a task.  So instead of just a single 
> location, a ResourceRequest would contain an array of locations, including 
> nodes that it would be happy with, racks that it would be happy with, and 
> possibly *.  Side effects of this change would be a reduction in the amount 
> of data that needs to be transferred in a heartbeat, as well in as the RM's 
> memory footprint, becaused what used to be different requests for the same 
> task are now able to share some common data.
> While this change breaks compatibility, if it is going to happen, it makes 
> sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to