Aleksandr Balitsky created YARN-5691:
----------------------------------------
Summary: RM failed Failed to load/recover state due to bad
DelegationKey in RM State Store
Key: YARN-5691
URL: https://issues.apache.org/jira/browse/YARN-5691
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.7.3, 2.7.2, 2.7.1, 2.7.0
Reporter: Aleksandr Balitsky
Priority: Minor
RM failed while recovery with the following error:
2016-09-12 21:32:21,999 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to
load/recover state
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
at
org.apache.hadoop.security.token.delegation.DelegationKey.readFields(DelegationKey.java:110)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadRMDTSecretManagerState(FileSystemRMStateStore.java:346)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadState(FileSystemRMStateStore.java:199)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1007)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1048)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1044)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1084)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1221)
2016-09-12 21:32:22,002 INFO org.apache.hadoop.service.AbstractService: Service
RMActiveServices failed in state STARTED; cause: java.io.EOFException
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
at
org.apache.hadoop.security.token.delegation.DelegationKey.readFields(DelegationKey.java:110)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadRMDTSecretManagerState(FileSystemRMStateStore.java:346)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadState(FileSystemRMStateStore.java:199)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1007)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1048)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1044)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1084)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1221)
2016-09-12 21:32:22,008 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping ResourceManager metrics system...
2016-09-12 21:32:22,009 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
ResourceManager metrics system stopped.
2016-09-12 21:32:22,009 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
ResourceManager metrics system shutdown complete.
2016-09-12 21:32:22,010 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
AsyncDispatcher is draining to stop, igonring any new events.
2016-09-12 21:32:22,012 INFO org.apache.hadoop.service.AbstractService: Service
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state
STOPPED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:250)
at
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:256)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:614)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1007)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1048)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
It happens due to DelegationKey_45 file, which has size 0. You can easily
reproduce it by placing this file under
/var/user/cluster/yarn/rm/system/FSRMStateRoot/RMDTSecretManagerRoot/ direcrory
in hdfs and then restart RM.
The solution is to add check for empty stream with DelegationKey data to
prevent RM failing during start.
Additionally, there is method "storeRMDTMasterKeyState" in ZKRMStateStore.java
that stores DelagationKey file (file was broken (empty) in our case). This
method can leave DelegationKey file empty in case of errors in write method of
DataOutputStream . There is already fixed jira that prevents possible resource
leak in this method: https://issues.apache.org/jira/browse/YARN-5663
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]