[jira] [Commented] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang
[ https://issues.apache.org/jira/browse/YARN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526473#comment-14526473 ] Sunil G commented on YARN-1662: --- Yes [~jianhe] we can close this issue. After YARN-1769, we have a better reservation too. I checked this and its not happening now. Capacity Scheduler reservation issue cause Job Hang --- Key: YARN-1662 URL: https://issues.apache.org/jira/browse/YARN-1662 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.2.0 Environment: Suse 11 SP1 + Linux Reporter: Sunil G There are 2 node managers in my cluster. NM1 with 8GB NM2 with 8GB I am submitting a Job with below details: AM with 2GB Map needs 5GB Reducer needs 3GB slowstart is enabled with 0.5 10maps and 50reducers are assigned. 5maps are completed. Now few reducers got scheduled. Now NM1 has 2GB AM and 3Gb Reducer_1[Used 5GB] NM2 has 3Gb Reducer_2 [Used 3GB] A Map has now reserved(5GB) in NM1 which has only 3Gb free. It hangs forever. Potential issue is, reservation is now blocked in NM1 for a Map which needs 5GB. But the Reducer_1 hangs by waiting for few map ouputs. Reducer side preemption also not happened as few headroom is still available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang
[ https://issues.apache.org/jira/browse/YARN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523998#comment-14523998 ] Jian He commented on YARN-1662: --- Hi [~sunilg], YARN-1198 has fixed a number of headRoom issues to make sure the headroom is correct so that the reducer preemption will kick in correctly. In that case, this problem may be resolved ? Capacity Scheduler reservation issue cause Job Hang --- Key: YARN-1662 URL: https://issues.apache.org/jira/browse/YARN-1662 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Environment: Suse 11 SP1 + Linux Reporter: Sunil G There are 2 node managers in my cluster. NM1 with 8GB NM2 with 8GB I am submitting a Job with below details: AM with 2GB Map needs 5GB Reducer needs 3GB slowstart is enabled with 0.5 10maps and 50reducers are assigned. 5maps are completed. Now few reducers got scheduled. Now NM1 has 2GB AM and 3Gb Reducer_1[Used 5GB] NM2 has 3Gb Reducer_2 [Used 3GB] A Map has now reserved(5GB) in NM1 which has only 3Gb free. It hangs forever. Potential issue is, reservation is now blocked in NM1 for a Map which needs 5GB. But the Reducer_1 hangs by waiting for few map ouputs. Reducer side preemption also not happened as few headroom is still available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang
[ https://issues.apache.org/jira/browse/YARN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887471#comment-13887471 ] Sunil G commented on YARN-1662: --- A timed reservation logic if we can implement here, then it will be safer for the fresh allocation to try in some other node. I have reviewd the scheduler part and found that without a seperate timer thread, this can be achieved. addReReservation() will be invoked when the same node tries to rereserve the same applications requests in the node. This is a multiset, hence the internal count will increment everytime when this addReReservation() is performed. Also this will be incremented in every 1 sec(node heartbeat interval) only. I wish to add a code like below in LeafQueue::assignContainer() method. If the limit exceeds, i will try unreseve the same from the node. This code will hit when the same application trying to re-reserve again in same node. } else { // Reserve by 'charging' in advance... reserve(application, priority, node, rmContainer, container); // Check for re-reservation limit. In this case, unreserve and try for a // fresh allocation. if (RESERVATION_TIME_LIMIT != 0 application.getReReservations(priority) RESERVATION_TIME_LIMIT) { unreserve(application, priority, node, rmContainer); return Resources.none(); } So for the next nodeupdate from some other node, CS can try allocate resource to this application. NB: Reservation is to ensure that same task can stick on to same node where its better to run. A bigger configurable limit which is based on the nature of the tasks running, can still achieve the above behavior. Please share your thoughts. Capacity Scheduler reservation issue cause Job Hang --- Key: YARN-1662 URL: https://issues.apache.org/jira/browse/YARN-1662 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Environment: Suse 11 SP1 + Linux Reporter: Sunil G There are 2 node managers in my cluster. NM1 with 8GB NM2 with 8GB I am submitting a Job with below details: AM with 2GB Map needs 5GB Reducer needs 3GB slowstart is enabled with 0.5 10maps and 50reducers are assigned. 5maps are completed. Now few reducers got scheduled. Now NM1 has 2GB AM and 3Gb Reducer_1[Used 5GB] NM2 has 3Gb Reducer_2 [Used 3GB] A Map has now reserved(5GB) in NM1 which has only 3Gb free. It hangs forever. Potential issue is, reservation is now blocked in NM1 for a Map which needs 5GB. But the Reducer_1 hangs by waiting for few map ouputs. Reducer side preemption also not happened as few headroom is still available. -- This message was sent by Atlassian JIRA (v6.1.5#6160)