[jira] [Updated] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP

2014-03-11 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated MAPREDUCE-3184:
--

Assignee: Todd Lipcon  (was: Jordan Zimmerman)

 Improve handling of fetch failures when a tasktracker is not responding on 
 HTTP
 ---

 Key: MAPREDUCE-3184
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobtracker
Affects Versions: 0.20.205.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 1.0.1

 Attachments: mr-3184.txt


 On a 100 node cluster, we had an issue where one of the TaskTrackers was hit 
 by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed 
 was the following:
 - every reducer would try to fetch the same map task, and fail after ~13 
 minutes.
 - At that point, all reducers would report this failed fetch to the JT for 
 the same task, and the task would be re-run.
 - Meanwhile, the reducers would move on to the next map task that ran on the 
 TT, and hang for another 13 minutes.
 The job essentially made no progress for hours, as each map task that ran on 
 the bad node was serially marked failed.
 To combat this issue, we should introduce a second type of failed fetch 
 notification, used when the TT does not respond at all (ie 
 SocketTimeoutException, etc). These fetch failure notifications should count 
 against the TT at large, rather than a single task. If more than half of the 
 reducers report such an issue for a given TT, then all of the tasks from that 
 TT should be re-run.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP

2012-02-12 Thread Matt Foley (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated MAPREDUCE-3184:
--

Target Version/s: 1.0.1  (was: 1.1.0)
   Fix Version/s: (was: 1.1.0)
  1.0.1

 Improve handling of fetch failures when a tasktracker is not responding on 
 HTTP
 ---

 Key: MAPREDUCE-3184
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobtracker
Affects Versions: 0.20.205.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 1.0.1

 Attachments: mr-3184.txt


 On a 100 node cluster, we had an issue where one of the TaskTrackers was hit 
 by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed 
 was the following:
 - every reducer would try to fetch the same map task, and fail after ~13 
 minutes.
 - At that point, all reducers would report this failed fetch to the JT for 
 the same task, and the task would be re-run.
 - Meanwhile, the reducers would move on to the next map task that ran on the 
 TT, and hang for another 13 minutes.
 The job essentially made no progress for hours, as each map task that ran on 
 the bad node was serially marked failed.
 To combat this issue, we should introduce a second type of failed fetch 
 notification, used when the TT does not respond at all (ie 
 SocketTimeoutException, etc). These fetch failure notifications should count 
 against the TT at large, rather than a single task. If more than half of the 
 reducers report such an issue for a given TT, then all of the tasks from that 
 TT should be re-run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP

2011-10-17 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated MAPREDUCE-3184:
---

Attachment: mr-3184.txt

Here is a patch implementing the approach described above.

It includes a unit test which shows that the new code can identify a spinning 
thread.

I also tested the new code by setting the abort threshold to 50% and pounding a 
tasktracker with an HTTP benchmark tool. This  resulted in the TT aborting as 
expected when CPU usage of the selector thread crossed 50%.

If administrators find that this triggers on false positives, the feature can 
be entirely disabled by setting mapred.tasktracker.jetty.cpu.check.enabled to 
false, or the threshold can be configured with 
mapred.tasktracker.jetty.cpu.threshold.fatal  (default 90%)

 Improve handling of fetch failures when a tasktracker is not responding on 
 HTTP
 ---

 Key: MAPREDUCE-3184
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobtracker
Affects Versions: 0.20.205.0
Reporter: Todd Lipcon
 Attachments: mr-3184.txt


 On a 100 node cluster, we had an issue where one of the TaskTrackers was hit 
 by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed 
 was the following:
 - every reducer would try to fetch the same map task, and fail after ~13 
 minutes.
 - At that point, all reducers would report this failed fetch to the JT for 
 the same task, and the task would be re-run.
 - Meanwhile, the reducers would move on to the next map task that ran on the 
 TT, and hang for another 13 minutes.
 The job essentially made no progress for hours, as each map task that ran on 
 the bad node was serially marked failed.
 To combat this issue, we should introduce a second type of failed fetch 
 notification, used when the TT does not respond at all (ie 
 SocketTimeoutException, etc). These fetch failure notifications should count 
 against the TT at large, rather than a single task. If more than half of the 
 reducers report such an issue for a given TT, then all of the tasks from that 
 TT should be re-run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira