[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570109#comment-13570109
 ] 

Tom White commented on YARN-371:


Looks like there's a misunderstanding here - Sandy talks about _reducing_ the 
memory requirements of the RM. If I understand the proposal correctly, the 
number of resource request objects sent by the AM in MR would be reduced from 
five (three node-local, one rack-local, one ANY) to one resource request with 
an array of locations (host names) of length five.

BTW Arun, immediately vetoing an issue in the first comment is not conducive to 
a balanced discussion!

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570314#comment-13570314
 ] 

Arun C Murthy commented on YARN-371:


{quote}
Looks like there's a misunderstanding here - Sandy talks about reducing the 
memory requirements of the RM. If I understand the proposal correctly, the 
number of resource request objects sent by the AM in MR would be reduced from 
five (three node-local, one rack-local, one ANY) to one resource request with 
an array of locations (host names) of length five.
{quote}

Please read my explanation again. 

This change is *explicitly* against the design goals of YARN ResourceManager 
and would increase memory requirements of RM by a couple of orders of magnitude.

Hadoop MR applications, routinely, have 100K+ tasks. The proposed change in 
this jira would require 100K+ resource-requests (one per task). Currently, in 
YARN, that can be expressed in O(nodes + racks + 1) resource-requests, which is 
~O(5000) on even the largest clusters known today. 

So, in effect, this change would be a significant regression and result in 
100,000 resource-requests v/s ~5000 needed today.

bq. BTW Arun, immediately vetoing an issue in the first comment is not 
conducive to a balanced discussion!

Tom - You can read it as a veto, or you can read it as *I strongly disagree 
since this is against the goals of the project and a significant regression*. 
IAC, we should allow for people's communication style... and keep discussions 
technical - I'd appreciate that.

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570325#comment-13570325
 ] 

Arun C Murthy commented on YARN-371:


Sandy - please don't let this side discussion distract you, it's an individual 
style thing. I use -1 on grocery list discussions with my wife... 
unfortunately I don't have the luxury of vetos in that context! *smile*

Anyway, there are other good discussion points such as 'allowing requests on a 
specific node/rack' which I have pondered about for a long while too. Maybe we 
can close this jira and open one for specific enhancements?



In future, it would help if jira descriptions are short and propose a specific 
enhancement - this way we can debate solutions separately (maybe even on *-dev 
list). 

On the plus side, this way I can -1 a specific implementation proposal rather 
than the jira too... ;-)

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570329#comment-13570329
 ] 

Robert Joseph Evans commented on YARN-371:
--

Tom just like Arun said the memory usage changes based off of the size of the 
cluster vs. the size of the request.  The current approach is on the order of 
the size of the cluster where as the proposed approach is on the order of the 
number of desired containers.  If I have a 100 node cluster and I am requesting 
10 map tasks the size will be O(100 nodes + X racks + 1) possibly * 2 if 
reducers are included in it. What is more it is probably exactly the same size 
of request for 1 or even 1000 tasks.  Where as the proposed approach would 
grow without bound as the number of tasks also increased.

However, I also agree with Sandy that the current state compression is lossy 
and as such restricts what is possible in the scheduler. I would like to 
understand better what the size differences would be for various requests, both 
in memory and also over the wire.  It seems conceivable to me that if the size 
difference is not too big, especially over the wire, we could allow the 
scheduler itself to decide on its in memory representation.  This would allow 
for the Capacity Scheduler to keep its current layout and allow for others to 
experiment with more advanced scheduling options.  Different groups could 
decide which scheduler best fits their needs and workload.  If the size is 
significantly larger I would like to see hard numbers about how much 
better/worse it makes specific use cases.

I am also very concerned about adding too much complexity to the scheduler.  We 
have run into issues where the RM will get very far behind in scheduling 
because it is trying to do a lot already and eventually OOM as its event queue 
grows too large. 

I also don't want to change the scheduler protocol too much without first 
understanding how that new protocol would impact other potential scheduling 
features.  There are a number of other computing patterns that could benefit 
from specific scheduler support.  Things like gang scheduling where you need 
all of the containers at once or none of them can make any progress, or where 
you want all of the containers to be physically close to one another because 
they are very I/O intensive, but you don't really care where exactly they are.  
Or even something like HBase where you essentially want one process on every 
single node with no duplicates.  Do the proposed changes make these uses case 
trivially simple, or do they require a lot of support on the AM to implement 
them?

  

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a 

[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-01 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569277#comment-13569277
 ] 

Arun C Murthy commented on YARN-371:


{quote}
I would propose the tweak of allowing a single ResourceRequest to encapsulate 
all the location information for a task. So instead of just a single location, 
a ResourceRequest would contain an array of locations, including nodes that it 
would be happy with, racks that it would be happy with, and possibly *. Side 
effects of this change would be a reduction in the amount of data that needs to 
be transferred in a heartbeat, as well in as the RM's memory footprint, 
becaused what used to be different requests for the same task are now able to 
share some common data.
{quote}

-1

The main advantage of the proposed model is that it is extremely compact in 
terms of the amount of state necessary, per application, on the ResourceManager 
for scheduling and the amount of information passed around between the 
ApplicationMaster  ResourceManager. This is crucial for scaling the 
ResourceManager. The amount of information, per application, in this model is 
always O(cluster size), whereas in the current Hadoop Map-Reduce JobTracker it 
is O(number of tasks) which could run into hundreds of thousands of tasks. For 
large jobs it is sufficient to ask for containers only on racks and not 
specific hosts since the ApplicationMaster can use them appropriately since 
each rack has many appropriate resources (i.e. input splits for MapReduce 
applications).

To be clear, we should avoid *task* specific view and stay 'resource-specific' 
view.

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira