[ 
https://issues.apache.org/jira/browse/YARN-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923926#comment-13923926
 ] 

Jason Lowe commented on YARN-1800:
----------------------------------

This is the aftermath of an earlier error that shutdown the public localizer 
thread.  We can harden the NM against these kinds of errors when the public 
localizer shuts down, but note that the NM will be running in a damaged state 
where every public localization will fail the container.  This is better than 
the NM taking down everything, but we also need to get to the real root cause.  
 From the posted logs:

{noformat}
2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{ 
hdfs://colo-2:8020/user/fertrist/oozie-oozi/0000601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
 1390440382009, FILE, null }
2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(726)) - Error: Shutting down
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(728)) - Public cache exiting
{noformat}

That's what really triggered this mess, and it's even more important to fix 
that.  I'll file a separate JIRA.

> YARN NodeManager with java.util.concurrent.RejectedExecutionException
> ---------------------------------------------------------------------
>
>                 Key: YARN-1800
>                 URL: https://issues.apache.org/jira/browse/YARN-1800
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>         Environment: HDP 2.0
>            Reporter: Paul Isaychuk
>            Assignee: Varun Vasudev
>            Priority: Critical
>         Attachments: apache-yarn-1800.0.patch, 
> yarn-yarn-nodemanager-fertrist-2-4.log.zip
>
>
> Noticed this on tests running on BWGA cluster
> {code}
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/0000605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar
>  transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.splitmetainfo
>  transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.split 
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.xml 
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,576 INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(651)) - Downloading public 
> rsrc:{ 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/0000605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
>  1390440627435, FILE, null }
> 2014-01-23 01:30:28,576 FATAL event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(141)) - Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException
>         at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
>         at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:678)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:583)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:525)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
>         at java.lang.Thread.run(Thread.java:662)
> 2014-01-23 01:30:28,577 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(144)) - Exiting, bbye..
> 2014-01-23 01:30:28,596 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped 
> [email protected]:50060
> 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(328)) - 
> Applications still running : [application_1389742077466_0396]
> 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(336)) - Wa
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to