[ https://issues.apache.org/jira/browse/YARN-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864125#comment-16864125 ]
Bibin A Chundatt commented on YARN-9627: ---------------------------------------- Looked into logs at cluster side, following are the understanding # Pending application leads to have {{renewerService}} queue having soo many {{DelegationTokenRenewerAppRecoverEvent}} to process # Due to Zk connection issue RM transistion from active to standby # DelagationTokenRenewer#serviceStop order seems wrong {code} appTokens.clear(); allTokens.clear(); {code} # Clear done before executionShutdown . Causing NPE at {{appTokens.get(applicationId)}} inside appTokens are cleared {code} if (currentDtr != null) { // another job beat us currentDtr.referringAppIds.add(applicationId); appTokens.get(applicationId).add(currentDtr); } else { appTokens.get(applicationId).add(dtr); setTimerForTokenRenewal(dtr); } {code} # NPE cause invalid state in Pool *Solution* # Correct order of shutdown in DelegationTokenRenewer # Use {{HadoopThreadPoolExecutor}} instead of ThreadPoolExcecutors # Ignore event if {{isServiceStarted}} is false Looking at the code .. I think {{DelegationTokenRenewer}} could block transistion to standby of RM. Thoughts?? > DelegationTokenRenewer throws exception on switchover > ----------------------------------------------------- > > Key: YARN-9627 > URL: https://issues.apache.org/jira/browse/YARN-9627 > Project: Hadoop YARN > Issue Type: Bug > Reporter: krishna reddy > Priority: Major > > Cluster size: 5K > Running containers: 55K > *Scenario*: Largenumber of pending applications (around 50K) and performing > RM switch over > Below exception : > {noformat} > 2019-06-13 17:39:27,594 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Renew Kind: HDFS_DELEGATION_TOKEN, Service: XXXXXXXXX:1616, Ident: (token > for root: HDFS_DELEGATION_TOKEN owner=root/had...@hadoop.com, renewer=yarn, > realUser=, issueDate=1560361265181, maxDate=1560966065181, > sequenceNumber=104708, masterKeyId=3);exp=1560533965360; > apps=[application_1560346941775_20702] in 86397766 ms, appId = > [application_1560346941775_20702] > 2019-06-13 17:39:27,609 WARN > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Unable to add the application to the delegation token renewer on recovery. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:522) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > 2019-06-13 17:58:20,878 ERROR org.apache.zookeeper.ClientCnxn: Time out error > occurred for the packet 'clientPath:null serverPath:null finished:false > header:: 27,4 replyHeader:: 27,4295687588,0 request:: > '/rmstore1/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot/DelegationKey_49,F > response:: > #31ffffff8a16b74ffffffe129768ffffffdbffffffe949ffffff8dffffffd517ffffffcafffffffa,s{4295423577,4295423577,1560342837789,1560342837789,0,0,0,0,17,0,4295423577} > '. > 2019-06-13 17:58:20,877 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: > XXXXXXXXX:1616, Ident: (token for root: HDFS_DELEGATION_TOKEN > owner=root/had...@hadoop.com, renewer=yarn, realUser=, > issueDate=1560366110990, maxDate=1560970910990, sequenceNumber=111891, > masterKeyId=3);exp=1560534896413; apps=[application_1560346941775_28115]] > 2019-06-13 17:58:20,924 WARN > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Unable to add the application to the delegation token renewer on recovery. > java.lang.IllegalStateException: Timer already cancelled. > at java.util.Timer.sched(Timer.java:397) > at java.util.Timer.schedule(Timer.java:208) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:612) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:523) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org