[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-7022: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.1.0 Status: Resolved (was: Patch Available) Thanks, [~johang]! I committed this to trunk. > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson >Assignee: Johan Gustavsson >Priority: Major > Fix For: 3.1.0 > > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, > MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch, > MAPREDUCE-7022.006.patch, MAPREDUCE-7022.007.patch, MAPREDUCE-7022.008.patch, > MAPREDUCE-7022.009.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.009.patch > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson >Assignee: Johan Gustavsson >Priority: Major > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, > MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch, > MAPREDUCE-7022.006.patch, MAPREDUCE-7022.007.patch, MAPREDUCE-7022.008.patch, > MAPREDUCE-7022.009.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.008.patch > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson >Assignee: Johan Gustavsson >Priority: Major > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, > MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch, > MAPREDUCE-7022.006.patch, MAPREDUCE-7022.007.patch, MAPREDUCE-7022.008.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.007.patch > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson >Assignee: Johan Gustavsson >Priority: Major > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, > MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch, > MAPREDUCE-7022.006.patch, MAPREDUCE-7022.007.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.006.patch > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson >Assignee: Johan Gustavsson > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, > MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch, > MAPREDUCE-7022.006.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.005.patch > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson >Assignee: Johan Gustavsson > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, > MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.004.patch > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson >Assignee: Johan Gustavsson > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, > MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.003.patch > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson >Assignee: Johan Gustavsson > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, > MAPREDUCE-7022.003.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.002.patch Should fix related tests and style issues > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.7.0, 2.8.0, 2.9.0 >Reporter: Johan Gustavsson > Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Affects Version/s: 2.7.0 2.8.0 2.9.0 Status: Patch Available (was: Open) > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Affects Versions: 2.9.0, 2.8.0, 2.7.0 >Reporter: Johan Gustavsson > Attachments: MAPREDUCE-7022.001.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size
[ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Gustavsson updated MAPREDUCE-7022: Attachment: MAPREDUCE-7022.001.patch > Fast fail rogue jobs based on task scratch dir size > --- > > Key: MAPREDUCE-7022 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task >Reporter: Johan Gustavsson > Attachments: MAPREDUCE-7022.001.patch > > > With the introduction of MAPREDUCE-6489 there are some options to kill rogue > tasks based on writes to local disk writes. In our environment are we mainly > run Hive based jobs we noticed that this counter and the size of the local > scratch dirs were very different. We had tasks where BYTES_WRITTEN counter > were at 300Gb and where it was at 10Tb both producing around 200Gb on local > disk, so it didn't help us much. So to extend this feature tasks should > monitor local scratchdir size and fail if they pass the limit. In these cases > the tasks should not be retried either but instead the job should fast fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org