[ https://issues.apache.org/jira/browse/YARN-11719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021527#comment-18021527 ]
ASF GitHub Bot commented on YARN-11719: --------------------------------------- github-actions[bot] commented on PR #7077: URL: https://github.com/apache/hadoop/pull/7077#issuecomment-3314278577 We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open it and ask for a committer to remove the stale tag and review again. Thanks all for your contribution. > The job is stuck in the new state. > ---------------------------------- > > Key: YARN-11719 > URL: https://issues.apache.org/jira/browse/YARN-11719 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.3.1 > Reporter: zeekling > Priority: Major > Labels: pull-request-available > > After I restarted the router in the production environment, several jobs > remained in the new state. and i found related log here. > > {code:java} > 2024-08-30 00:12:41,380 | WARN | DelegationTokenRenewer #667 | Unable to add > the application to the delegation token renewer. | > DelegationTokenRenewer.java:1215 > java.io.IOException: Failed to renew token: Kind: HDFS_DELEGATION_TOKEN, > Service: ha-hdfs:nsfed, Ident: (token for admintest: HDFS_DELEGATION_TOKEN > owner=admintest@9FCE074E_691F_480F_98F5_58C1CA310829.COM, renewer=mapred, > realUser=, issueDate=1724947875776, maxDate=1725552675776, > sequenceNumber=156, masterKeyId=116) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:641) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$2200(DelegationTokenRenewer.java:86) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:1211) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:1188) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750)Caused by: > java.io.InterruptedIOException: Retry interrupted > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:141) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:112) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366) > at com.sun.proxy.$Proxy96.renewDelegationToken(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient$Renewer.renew(DFSClient.java:849) > at org.apache.hadoop.security.token.Token.renew(Token.java:498) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:771) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:768) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1890) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:767) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:627) > > ... 8 more > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:135) > ... 20 more > 2024-08-30 00:12:41,380 | WARN | DelegationTokenRenewer #667 | > AsyncDispatcher thread interrupted | AsyncDispatcher.java:437 > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1233) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:434) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:1221) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:1188) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > 2024-08-30 00:12:41,381 | WARN | DelegationTokenRenewer #667 | Caught > exception in thread DelegationTokenRenewer #667: | ExecutorHelper.java:63 > java.util.concurrent.CancellationException > at java.util.concurrent.FutureTask.report(FutureTask.java:121) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.hadoop.util.concurrent.ExecutorHelper.logThrowableFromAfterExecute(ExecutorHelper.java:48) > at > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor.afterExecute(HadoopThreadPoolExecutor.java:90) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157) > {code} > > params: > yarn.resourcemanager.delegation-token-renewer.thread-time=80S > dfs.client.socket-timeout=60S > When the Router is restarted, RM is renewing the token. At this time, the > token renewal will try multiple times, and it will sleep for a while between > each retry. After more than 80 seconds, the token renewal thread will be > interrupted by the following code > > {code:java} > DelegationTokenRenewerEvent evt = dtrf.getEvt(); > Future<?> future = dtrf.getFuture(); > try { > future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS); > } catch (TimeoutException e) { > // Cancel thread and retry the same event in case of timeout. > if (!future.isDone() && !future.isCancelled()) { > future.cancel(true); > if (evt.getAttempt() < tokenRenewerThreadRetryMaxAttempts) { > renewalTimer.schedule( > getTimerTask((AbstractDelegationTokenRenewerAppEvent) evt), > tokenRenewerThreadRetryInterval); > } else { > LOG.info( > "Exhausted max retry attempts {} in token renewer " > + "thread for {}", > tokenRenewerThreadRetryMaxAttempts, evt.getApplicationId()); > } > } > } catch (Exception e) { > LOG.info("Problem in submitting renew tasks in token renewer " > + "thread.", e); > } {code} > After the interruption, it will be captured by the following code, and the > interruption will be re-triggered, and an exception will be thrown. The renew > token operation fails, and the state machine of the job needs to change from > new to rejected. > {code:java} > try { > Thread.sleep(retryInfo.delay); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > if (LOG.isDebugEnabled()) { > LOG.debug("Interrupted while waiting to retry", e); > } > InterruptedIOException intIOE = new InterruptedIOException( > "Retry interrupted"); > intIOE.initCause(e); > throw intIOE; > } {code} > However, since the interrupt signal is re-triggered, the interrupt signal > will be detected in the following code of AsyncDispatcher.java, resulting in > the failure of state transition. > {code:java} > try { > eventQueue.put(event); > } catch (InterruptedException e) { > if (!stopped) { > LOG.warn("AsyncDispatcher thread interrupted", e); > } > // Need to reset drained flag to true if event queue is empty, > // otherwise dispatcher will hang on stop. > drained = eventQueue.isEmpty(); > throw new YarnRuntimeException(e); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org