[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-07-07 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377054#comment-17377054
 ] 

Bilwa S T commented on MAPREDUCE-7353:
--

Thank you [~epayne]

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0, 2.10.2, 3.2.3, 3.3.2
>
> Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0
>  | Fetcher.java:686

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-07-07 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376831#comment-17376831
 ] 

Eric Payne commented on MAPREDUCE-7353:
---

+1. Will commit now.
Thanks very much, [~BilwaST].

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0
>  | Fetcher.java:686
>   Line 74093: 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-07-07 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376772#comment-17376772
 ] 

Eric Payne commented on MAPREDUCE-7353:
---

Thanks a lot, [~BilwaST], for the patch update. The code and UT LGTM. I want to 
run a few more tests in my environment, but once I've done that I'll commit.

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-30 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371932#comment-17371932
 ] 

Bilwa S T commented on MAPREDUCE-7353:
--

Hi [~epayne] can you please check updated patch? Thanks

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0
>  | Fetcher.java:686
>   Line 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-24 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368873#comment-17368873
 ] 

Hadoop QA commented on MAPREDUCE-7353:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 
30s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 
42s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 24s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 16m 
30s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m  
4s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
31s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
29s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
27s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
27s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 37s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green}{color} | {color:green} the 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-24 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368812#comment-17368812
 ] 

Bilwa S T commented on MAPREDUCE-7353:
--

Thanks [~epayne] for your review. I have added UT . Please take a look

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0
>  | Fetcher.java:686
>  

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368405#comment-17368405
 ] 

Eric Payne commented on MAPREDUCE-7353:
---

[~BilwaST], the changes LGTM. Would it be possible to add unit tests, perhaps 
to {{TestTaskAttempt}}?

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0
>  | 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-23 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368087#comment-17368087
 ] 

Bilwa S T commented on MAPREDUCE-7353:
--

Hi [~epayne]
Can you please take a look at this today if possible? Thanks

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0
>  | Fetcher.java:686
>   Line 74093: 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-17 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364993#comment-17364993
 ] 

Bilwa S T commented on MAPREDUCE-7353:
--

Ok Thanks [~epayne]

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0
>  | Fetcher.java:686
>   Line 74093: 2021-06-02 16:26:56,056 | INFO  | fetcher#9 | Reporting 
> 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-17 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364967#comment-17364967
 ] 

Eric Payne commented on MAPREDUCE-7353:
---

[~BilwaST], thanks for raising this. I have encountered a similar situation. 
This would be important to fix. I will try to look at this early next week. I 
appreciate your patience.

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364279#comment-17364279
 ] 

Hadoop QA commented on MAPREDUCE-7353:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 
10s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
45s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 16s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 16m 
17s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  0m 
58s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
32s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
30s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
30s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
27s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
27s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 45s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| 

[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped

2021-06-16 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364233#comment-17364233
 ] 

Bilwa S T commented on MAPREDUCE-7353:
--

cc [~epayne] [~jbrennan]

> Mapreduce job fails when NM is stopped
> --
>
> Key: MAPREDUCE-7353
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-7353.001.patch
>
>
> Job fails as task fail due to too many fetch failures 
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e03_1622107691213_1054_01_05 taskAttempt 
> attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394
>   Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | 
> KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209
>   Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | TaskAttempt killed because it ran on unusable node 
> node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | 
> JobImpl.java:1401
>   Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator 
> | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is 
> running on unusable node:node-group-1ZYEq0002:26009 | 
> RMContainerAllocator.java:1066
>   Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> Container released on a *lost* node | TaskAttemptImpl.java:2649
>   Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | 
> TaskAttemptImpl.java:1390
>   Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | Too many fetch-failures for output of task attempt: 
> attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | 
> JobImpl.java:2005
>   Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390
>   Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event 
> handler | attempt_1622107691213_1054_m_00_0 transitioned from state 
> SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE 
> and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
>   Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390
>   Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event 
> handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: 
> cleanup failed for container container_e03_1622107691213_1054_01_05 : 
> java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to 
> node-group-1ZYEq0002:26009 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event 
> handler | Processing attempt_1622107691213_1054_m_00_0 of type 
> TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390
>   Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 
> going to fetch from node-group-1ZYEq0002:26008 for: 
> [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318
>   Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL 
> for node-group-1ZYEq0002:26008 -> 
> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0
>  | Fetcher.java:686
>   Line 74093: 2021-06-02 16:26:56,056 | INFO  | fetcher#9 |