Jason Lowe commented on YARN-1801:

Strictly speaking, the patch does prevent the NPE.  However the public 
localizer is still effectively doomed if this condition occurs because it 
returns from the run() method.  That will shutdown the localizer thread and 
public local resource requests will stop being processed.  In that sense we've 
traded an NPE with a traceback for a one-line log message.  I'm not sure this 
is an improvement, since at least the traceback is easier to notice in the NM 
log and we get a corresponding fatal log when someone goes hunting for what 
went wrong with the public localizer.

The real issue is we need to understand what happened to cause 
pending.remove(completed) to return null.  This should never happen, and if it 
does then it means we have a bug.  Trying to recover from this condition is 
patching a symptom rather than a root cause.  The problem that lead to the null 
request event _might_ have been fixed by YARN-1575 which wasn't present in 2.2 
where the original bug occurred.  It would be interesting to know if this has 
reoccurred since 2.3.0.

Assuming this is still a potential issue, we should either find a way to 
prevent it from ever occurring or recover in a way that keeps the public 
localizer working as much as possible. It'd be great if we could just pull from 
the queue and receive a structure that has both the request event and the 
Future<Path> so we don't have to worry about a Future<Path> with no associated 
event.  If we're going to try to recover instead, we'd have to log an error and 
try to cleanup.  With no associated request event and no path if we got an 
execution error, it's going to be particularly difficult to recover properly.

> NPE in public localizer
> -----------------------
>                 Key: YARN-1801
>                 URL: https://issues.apache.org/jira/browse/YARN-1801
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Jason Lowe
>            Assignee: Hong Zhiguo
>            Priority: Critical
>         Attachments: YARN-1801.patch
> While investigating YARN-1800 found this in the NM logs that caused the 
> public localizer to shutdown:
> {noformat}
> 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(651)) - Downloading public 
> rsrc:{ 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/0000601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
>  1390440382009, FILE, null }
> 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(726)) - Error: Shutting down
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(728)) - Public cache exiting
> {noformat}

This message was sent by Atlassian JIRA

Reply via email to