[ 
https://issues.apache.org/jira/browse/YARN-11856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weihao Zheng updated YARN-11856:
--------------------------------
    Description: 
Currently, LocalizerRunner tries to send a ContainerResourceFailedEvent to 
dispatcher before unlocking and cleaning up downloading resource when failed by 
interruption. This handle invoking will throw uncaught exception, which causes 
other containers depend on the same private resource stuck at localizing.

 

Related logs:
{quote}2025-MM-DD HH:04:59,573 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Stopping container with container Id: 
container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX
... ...

2025-MM-DD HH:04:59,573 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX transitioned from 
LOCALIZING to KILLING
... ...

2025-MM-DD HH:04:59,627 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Unknown localizer with localizerId 
container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX is sending heartbeat. Ordering it 
to DIE
2025-MM-DD HH:04:59,628 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Localizer failed for container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX
java.io.IOException: java.io.InterruptedIOException: Interrupted waiting to 
send RPC request to server
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:200)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:180)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1247)
Caused by: java.io.InterruptedIOException: Interrupted waiting to send RPC 
request to server
at org.apache.hadoop.ipc.Client.call(Client.java:1446)
at org.apache.hadoop.ipc.Client.call(Client.java:1388)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy82.heartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:63)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:306)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:198)
... 2 more
Caused by: java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1158)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
... 9 more
2025-MM-DD HH:04:59,629 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: 
AsyncDispatcher thread interrupted
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1271)
2025-MM-DD HH:04:59,629 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LocalizerRunner for 
container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX,5,main] threw an Exception.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:312)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1271)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
... 1 more
{quote}

  was:
Currently, LocalizerRunner tries to send a ContainerResourceFailedEvent to 
dispatcher before unlocking and cleaning up downloading resource when failed by 
interruption. This handle invoking will throw uncaught exception.

 

Related logs:
{quote}2025-MM-DD HH:04:59,573 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Stopping container with container Id: 
container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX
... ...

2025-MM-DD HH:04:59,573 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX transitioned from 
LOCALIZING to KILLING
... ...

2025-MM-DD HH:04:59,627 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Unknown localizer with localizerId 
container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX is sending heartbeat. Ordering it 
to DIE
2025-MM-DD HH:04:59,628 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Localizer failed for container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX
java.io.IOException: java.io.InterruptedIOException: Interrupted waiting to 
send RPC request to server
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:200)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:180)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1247)
Caused by: java.io.InterruptedIOException: Interrupted waiting to send RPC 
request to server
at org.apache.hadoop.ipc.Client.call(Client.java:1446)
at org.apache.hadoop.ipc.Client.call(Client.java:1388)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy82.heartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:63)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:306)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:198)
... 2 more
Caused by: java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1158)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
... 9 more
2025-MM-DD HH:04:59,629 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: 
AsyncDispatcher thread interrupted
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1271)
2025-MM-DD HH:04:59,629 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LocalizerRunner for 
container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX,5,main] threw an Exception.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:312)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1271)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
... 1 more{quote}


> DOWNLOADING resources unlock and cleanup is interrupted when killing a 
> container that is localizing
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-11856
>                 URL: https://issues.apache.org/jira/browse/YARN-11856
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Weihao Zheng
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, LocalizerRunner tries to send a ContainerResourceFailedEvent to 
> dispatcher before unlocking and cleaning up downloading resource when failed 
> by interruption. This handle invoking will throw uncaught exception, which 
> causes other containers depend on the same private resource stuck at 
> localizing.
>  
> Related logs:
> {quote}2025-MM-DD HH:04:59,573 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: 
> container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX
> ... ...
> 2025-MM-DD HH:04:59,573 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX transitioned from 
> LOCALIZING to KILLING
> ... ...
> 2025-MM-DD HH:04:59,627 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Unknown localizer with localizerId 
> container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX is sending heartbeat. Ordering 
> it to DIE
> 2025-MM-DD HH:04:59,628 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed for container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX
> java.io.IOException: java.io.InterruptedIOException: Interrupted waiting to 
> send RPC request to server
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:200)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:180)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1247)
> Caused by: java.io.InterruptedIOException: Interrupted waiting to send RPC 
> request to server
> at org.apache.hadoop.ipc.Client.call(Client.java:1446)
> at org.apache.hadoop.ipc.Client.call(Client.java:1388)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy82.heartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:63)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:306)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:198)
> ... 2 more
> Caused by: java.lang.InterruptedException
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
> at java.util.concurrent.FutureTask.get(FutureTask.java:191)
> at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1158)
> at org.apache.hadoop.ipc.Client.call(Client.java:1441)
> ... 9 more
> 2025-MM-DD HH:04:59,629 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher thread interrupted
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1271)
> 2025-MM-DD HH:04:59,629 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e02_XXXXXXXXXXXXX_XXXXXXX_XX_XXXXXX,5,main] threw an Exception.
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:312)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1271)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304)
> ... 1 more
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to