[
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Song Jiacheng updated YARN-10791:
---------------------------------
Description:
We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception
while we upgrading NM.
When we exclude a node and call refreshNode gracefully, All the MR AMs will
fail.
2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING
RM.
java.lang.NullPointerException
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
at java.lang.Thread.run(Thread.java:745)
The reason of this is because we gracefully decomission nodes while using 2.6MR.
handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"
So I add a config to decide if we should send the DECOMMISONING to AMs
I don't know if it needs to be fixed, just raise a solution for this situation
!image-2021-05-31-10-32-17-541.png!
There are 2 nodes in the cluster, And the AM is deployed in node 44, I excluded
46, which is another node in the cluster, and then refreshnode, the error above
occured.
As what I say, I think the original reasion is the compatibility of
NodeStateProto
!image-2021-05-31-10-37-31-795.png!
2.6 MR can not recognize DECOMMISONING and SHUTDOWN
was:
We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception
while we upgrading NM.
When we exclude a node and call refreshNode gracefully, All the MR AMs will
fail.
2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING
RM.
java.lang.NullPointerException
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
at java.lang.Thread.run(Thread.java:745)
The reason of this is because we gracefully decomission nodes while using 2.6MR.
handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"
So I add a config to decide if we should send the DECOMMISONING to AMs
I don't know if it needs to be fixed, just raise a solution for this situation
> Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
> ----------------------------------------------------------------------
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
> Issue Type: Bug
> Components: RM
> Affects Versions: 3.2.1
> Reporter: Song Jiacheng
> Priority: Minor
> Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png,
> image-2021-05-31-10-37-31-795.png
>
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception
> while we upgrading NM.
> When we exclude a node and call refreshNode gracefully, All the MR AMs will
> fail.
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN
> CONTACTING RM.
> java.lang.NullPointerException
> at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
> at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
> at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
> at
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
> at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it needs to be fixed, just raise a solution for this situation
> !image-2021-05-31-10-32-17-541.png!
> There are 2 nodes in the cluster, And the AM is deployed in node 44, I
> excluded 46, which is another node in the cluster, and then refreshnode, the
> error above occured.
> As what I say, I think the original reasion is the compatibility of
> NodeStateProto
> !image-2021-05-31-10-37-31-795.png!
> 2.6 MR can not recognize DECOMMISONING and SHUTDOWN
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]