I recently filed a JIRA (YARN-371) that has sparked a fair bit of discussion. Arun asked me to move the discussion to yarn-dev. Here's a summary:
Each AMRM heartbeat consists of a list of resource requests. Currently, each resource request consists of a container count, a resource vector, and a location, which may be a node, a rack, or "*". When an application wishes to request a task run in multiple locations, it must issue a request for each location. This means that for a node-local task, it must issue three requests, one at the node-level, one at the rack-level, and one with * (any). These requests are not linked with each other, so when a container is allocated for one of them, the RM has no way of knowing which others to get rid of. When a node-local container is allocated, this is handled by decrementing the number of requests on that node's rack and in *. But when the scheduler allocates a task with a node-local request on its rack, the request on the node is left there. This can cause delay-scheduling to try to assign a container on a node that nobody cares about anymore. It also makes it impossible to support requests for only a single node (not its rack and * as well), as well as other weird kinds of needs, such as gang-scheduling, in which tasks need to be run at the same time, and scheduling for IO bound jobs where an application wants its tasks to run close to each other, but doesn't care where. My proposal is to modify the AMRM protocol to be "task-centric", i.e. to contain information about which requests to particular locations are linked to requests to other locations. This would allow future schedulers the ability to act on a task-centric view of an application's needs, and support a richer set of scheduling features. I understand that this is not the first time this discussion has been had, and I'm sure this design was considered during YARN's initial design stages. However, I believe that the current format may severely limit the types of applications that can be run on YARN, and as the system starts to become widely used, it is important to at least take another look at the protocol before APIs are locked down. Arun had concerns related to the overhead that would be induced by this change. I won't try to represent his views myself, but they are available on the JIRA. Thanks for reading, and let me know what you think! Sandy
