Sunil Kalva created YARN-11449:
----------------------------------
Summary: Fix Intermittent NPE while getting node labels for queue
Key: YARN-11449
URL: https://issues.apache.org/jira/browse/YARN-11449
Project: Hadoop YARN
Issue Type: Bug
Components: yarn
Reporter: Sunil Kalva
Assignee: Sunil Kalva
NPE is thrown in yarn client when trying to check on queue status.
Partial stack trace:
Caused by:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
java.lang.NullPointerException
at java.util.AbstractCollection.addAll(AbstractCollection.java:343)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getNodeLabelsForQueue(AbstractCSQueue.java:961)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getQueueConfigurations(AbstractCSQueue.java:528)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getQueueInfo(AbstractCSQueue.java:494)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueInfo(LeafQueue.java:472)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:1048)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.getResourceUsageReport(FiCaSchedulerApp.java:1041)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:408)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:142)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:954)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:761)
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:396)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:224)
at
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:529)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
Issue originates at
[https://git.corp.linkedin.com:1367/plugins/gitiles/hadoop/hadoop/+/li-2.10.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#960]
because of the lack of Null Check for this.getAccessibleNodeLabels()
which can be null because
[https://git.corp.linkedin.com:1367/plugins/gitiles/hadoop/hadoop/+/li-2.10.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#598]
when reading CapacitySchedularConfiguration and the following property is not
set
yarn.scheduler.capacity.<queuename>.accessible-node-labels
In such cases, RM usually handles by checking the same property for the queue's
parent. However, in the above cases, even that turned out to be unset.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]