[
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973352#comment-16973352
]
Tarun Parimi commented on YARN-9968:
------------------------------------
[~snemeth], I was finally able reproduce it artificially in my test cluster. I
added the below the sleep and subsequent exception in FSDownload class to
simulate the hdfs not responding for a minute and then throwing the exception
while trying to download. When the application which requested the resource
gets killed during the minute when the thread sleeps, I got null pointer issue
and public localizer exited.
{code:java}
try {
Thread.sleep(60000);
throw new ExecutionException("Test", new IOException("Exception"));
} catch (InterruptedException e) {
throw new IOException(e);
}
>From this I understood that the issue occurs when the below sequence of events
>occur,
1. The public localizer is waiting on the download of a file from hdfs for
quite some time.
2. Application get killed/failed while the download is still waiting/sleeping.
Due to this the app cleanup is triggered, which removes the
LocalResourcesTracker for that app.
{code:java}
private void handleDestroyApplicationResources(Application application) {
String userName = application.getUser();
ApplicationId appId = application.getAppId();
String appIDStr = application.toString();
LocalResourcesTracker appLocalRsrcsTracker =
appRsrc.remove(appId.toString());
{code}
3. The download finally fails and it throws an exception from HDFS.
4. Since the tracker was removed due to app kill, we get the NullPointer in
below code as tracker is null . This causes public localizer to exit and not
handle future localization requests.
{code:java}
tracker.handle(new ResourceFailedLocalizationEvent(
assoc.getResource().getRequest(), diagnostics));
{code}
This issue is introduced due to the changes in YARN-8403 , where the failed
localization is notified to the app for logging in the AM.
I think handling a null check and preventing this should be safe as the AM is
already killed in this scenario. Will provide an initial patch based on this.
cc [~prabhujoseph]
> Public Localizer is exiting in NodeManager due to NullPointerException
> ----------------------------------------------------------------------
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.1.0
> Reporter: Tarun Parimi
> Assignee: Tarun Parimi
> Priority: Major
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO localizer.ResourceLocalizationService
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for
> containers keep encountering the below error, resulting in failed
> Localization of all new containers.
> {code:java}
> ERROR localizer.ResourceLocalizationService
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc {
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
> for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21
> rejected from
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
> 382286]
> at
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]