[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364279#comment-17364279 ] Hadoop QA commented on MAPREDUCE-7353: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 10s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 45s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 16s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 16m 17s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 0m 58s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 45s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | |
[jira] [Updated] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7353: - Status: Patch Available (was: Open) > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting > fetch failure for
[jira] [Updated] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7353: - Attachment: MAPREDUCE-7353.001.patch > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting > fetch failure for
[jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
[ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364233#comment-17364233 ] Bilwa S T commented on MAPREDUCE-7353: -- cc [~epayne] [~jbrennan] > Mapreduce job fails when NM is stopped > -- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container > container_e03_1622107691213_1054_01_05 taskAttempt > attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 > Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | > KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 > Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | TaskAttempt killed because it ran on unusable node > node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | > JobImpl.java:1401 > Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator > | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is > running on unusable node:node-group-1ZYEq0002:26009 | > RMContainerAllocator.java:1066 > Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > Container released on a *lost* node | TaskAttemptImpl.java:2649 > Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | > TaskAttemptImpl.java:1390 > Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | Too many fetch-failures for output of task attempt: > attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | > JobImpl.java:2005 > Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 > Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event > handler | attempt_1622107691213_1054_m_00_0 transitioned from state > SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE > and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 > Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event > handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: > cleanup failed for container container_e03_1622107691213_1054_01_05 : > java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to > node-group-1ZYEq0002:26009 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event > handler | Processing attempt_1622107691213_1054_m_00_0 of type > TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 > Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 > going to fetch from node-group-1ZYEq0002:26008 for: > [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 > Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL > for node-group-1ZYEq0002:26008 -> > http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 > | Fetcher.java:686 > Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 |
[jira] [Created] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped
Bilwa S T created MAPREDUCE-7353: Summary: Mapreduce job fails when NM is stopped Key: MAPREDUCE-7353 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Bilwa S T Assignee: Bilwa S T Job fails as task fail due to too many fetch failures {code:java} Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_e03_1622107691213_1054_01_05 taskAttempt attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:394 Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | KILLING attempt_1622107691213_1054_m_00_0 | ContainerLauncherImpl.java:209 Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event handler | TaskAttempt killed because it ran on unusable node node-group-1ZYEq0002:26009. AttemptId:attempt_1622107691213_1054_m_00_0 | JobImpl.java:1401 Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | TaskAttemptImpl.java:1390 Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator | Killing taskAttempt:attempt_1622107691213_1054_m_00_0 because it is running on unusable node:node-group-1ZYEq0002:26009 | RMContainerAllocator.java:1066 Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | TaskAttemptImpl.java:1390 Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: Container released on a *lost* node | TaskAttemptImpl.java:2649 Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_KILL | TaskAttemptImpl.java:1390 Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event handler | Too many fetch-failures for output of task attempt: attempt_1622107691213_1054_m_00_0 ... raising fetch failure to map | JobImpl.java:2005 Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_TOO_MANY_FETCH_FAILURE | TaskAttemptImpl.java:1390 Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event handler | attempt_1622107691213_1054_m_00_0 transitioned from state SUCCESS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and nodeId=node-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_DIAGNOSTICS_UPDATE | TaskAttemptImpl.java:1390 Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1622107691213_1054_m_00_0: cleanup failed for container container_e03_1622107691213_1054_01_05 : java.net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to node-group-1ZYEq0002:26009 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event handler | Processing attempt_1622107691213_1054_m_00_0 of type TA_CONTAINER_CLEANED | TaskAttemptImpl.java:1390 Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 going to fetch from node-group-1ZYEq0002:26008 for: [attempt_1622107691213_1054_m_00_0] | Fetcher.java:318 Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL for node-group-1ZYEq0002:26008 -> http://node-group-1ZYEq0002:26008/mapOutput?job=job_1622107691213_1054=4=attempt_1622107691213_1054_m_00_0 | Fetcher.java:686 Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting fetch failure for attempt_1622107691213_1054_m_00_0 to MRAppMaster. | ShuffleSchedulerImpl.java:349 {code} As we can see from logs that RM reported AM about node update at 16:26:34 but event was skipped as KILL event is ignored when TaskAttemptImpl is in SUCCESS_CONTAINER_CLEANUP state. So next we receive TA_TOO_MANY_FETCH_FAILURE event which will lead to task fail. -- This