[
https://issues.apache.org/jira/browse/YARN-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923926#comment-13923926
]
Jason Lowe commented on YARN-1800:
----------------------------------
This is the aftermath of an earlier error that shutdown the public localizer
thread. We can harden the NM against these kinds of errors when the public
localizer shuts down, but note that the NM will be running in a damaged state
where every public localization will fail the container. This is better than
the NM taking down everything, but we also need to get to the real root cause.
From the posted logs:
{noformat}
2014-01-23 01:26:38,655 INFO localizer.ResourceLocalizationService
(ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{
hdfs://colo-2:8020/user/fertrist/oozie-oozi/0000601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
1390440382009, FILE, null }
2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService
(ResourceLocalizationService.java:run(726)) - Error: Shutting down
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
2014-01-23 01:26:38,656 INFO localizer.ResourceLocalizationService
(ResourceLocalizationService.java:run(728)) - Public cache exiting
{noformat}
That's what really triggered this mess, and it's even more important to fix
that. I'll file a separate JIRA.
> YARN NodeManager with java.util.concurrent.RejectedExecutionException
> ---------------------------------------------------------------------
>
> Key: YARN-1800
> URL: https://issues.apache.org/jira/browse/YARN-1800
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Environment: HDP 2.0
> Reporter: Paul Isaychuk
> Assignee: Varun Vasudev
> Priority: Critical
> Attachments: apache-yarn-1800.0.patch,
> yarn-yarn-nodemanager-fertrist-2-4.log.zip
>
>
> Noticed this on tests running on BWGA cluster
> {code}
> 2014-01-23 01:30:28,575 INFO localizer.LocalizedResource
> (LocalizedResource.java:handle(196)) - Resource
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/0000605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO localizer.LocalizedResource
> (LocalizedResource.java:handle(196)) - Resource
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.splitmetainfo
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO localizer.LocalizedResource
> (LocalizedResource.java:handle(196)) - Resource
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.split
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO localizer.LocalizedResource
> (LocalizedResource.java:handle(196)) - Resource
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.xml
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,576 INFO localizer.ResourceLocalizationService
> (ResourceLocalizationService.java:addResource(651)) - Downloading public
> rsrc:{
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/0000605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
> 1390440627435, FILE, null }
> 2014-01-23 01:30:28,576 FATAL event.AsyncDispatcher
> (AsyncDispatcher.java:dispatch(141)) - Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException
> at
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:678)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:583)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:525)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
> at java.lang.Thread.run(Thread.java:662)
> 2014-01-23 01:30:28,577 INFO event.AsyncDispatcher
> (AsyncDispatcher.java:dispatch(144)) - Exiting, bbye..
> 2014-01-23 01:30:28,596 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped
> [email protected]:50060
> 2014-01-23 01:30:28,597 INFO containermanager.ContainerManagerImpl
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(328)) -
> Applications still running : [application_1389742077466_0396]
> 2014-01-23 01:30:28,597 INFO containermanager.ContainerManagerImpl
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(336)) - Wa
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)