Prabhu Joseph created YARN-11494: ------------------------------------ Summary: Acquired Containers are killed when the node is reconnected Key: YARN-11494 URL: https://issues.apache.org/jira/browse/YARN-11494 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.3.3 Reporter: Prabhu Joseph Assignee: Prabhu Joseph
When a nodemanager is reconnected, resourcemanager marks the acquired containers on that node as LOST and which leads to job failure. {code} 2023-04-10 02:57:16,412 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC Server handler 41 on 8025): Reconnect from the node at: node1 2023-04-10 02:57:16,412 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC Server handler 41 on 8025): NodeManager from node node1(cmPort: 8041 httpPort: 8042) registered with capability: <memory:122880, vCores:16>, assigned nodeId node1:8041, node labels { CORE } 2023-04-10 02:57:16,413 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_e15_1677844874019_238016_01_000002 Container Transitioned from ACQUIRED to KILLED {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org