[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972420#comment-16972420
 ] 

Tarun Parimi commented on YARN-9968:
------------------------------------

Hi [~snemeth]. Thanks for looking into this.
The issue is not reproducing for me so far. This is happening on a heavily 
loaded prod cluster. The cluster also is configured to use 
DefaultContainerExecutor , so the localizing is all done completely inside the 
NM jvm process.

The null pointer occurs in the below code where tracker.handle() is called. 
Looks like tracker is becoming null for some reason. Doing a null check on 
tracker might be a simple workaround, but understanding how the issue occurred 
might give us a better way to fix this.
{code:java}
 final String diagnostics = "Failed to download resource " +
                  assoc.getResource() + " " + e.getCause();
              tracker.handle(new ResourceFailedLocalizationEvent(
                  assoc.getResource().getRequest(), diagnostics));
{code}

There are also multiple HDFS warnings while doing localization in the log just 
before this NullPointerException. So I think those HDFS issues while localizing 
are definitely related and are causing the issue in the first place. But I 
haven't completely figured out how.

{code:java}
WARN  impl.BlockReaderFactory 
(BlockReaderFactory.java:getRemoteBlockReaderFromTcp(764)) - I/O error 
constructing remote block reader.
java.io.IOException: Got error, status=ERROR, status message opReadBlock 
BP-290360126-127.0.0.1-1559634768162:blk_3454574939_2740457478 received 
exception java.io.IOException: No data exists for block 
BP-290360126-127.0.0.1-1559634768162:blk_blk_3454574939_2740457478, for 
OP_READ_BLOCK, self=/127.0.0.1:15810, remote=/127.0.0.1:50010, for file 
/tmp/hadoop-yarn/staging/job-user/.staging/job_1571858983080_36874/job.jar, for 
pool BP-290360126-127.0.0.1-1559634768162 block 3814574939_2740867478
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110)
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReaderRemote.java:440)
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:408)
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853)
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749)
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
        at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:641)
        at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:572)
        at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:754)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at 
org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100)
        at 
org.apache.commons.io.input.TeeInputStream.read(TeeInputStream.java:129)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
        at java.util.zip.ZipInputStream.readFully(ZipInputStream.java:403)
        at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:278)
        at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:122)
        at java.util.jar.JarInputStream.<init>(JarInputStream.java:83)
        at java.util.jar.JarInputStream.<init>(JarInputStream.java:62)
        at org.apache.hadoop.util.RunJar.unJar(RunJar.java:114)
        at org.apache.hadoop.util.RunJar.unJarAndSave(RunJar.java:167)
        at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:354)
        at 
org.apache.hadoop.yarn.util.FSDownload.downloadAndUnpack(FSDownload.java:303)
        at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:283)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}




> Public Localizer is exiting in NodeManager due to NullPointerException
> ----------------------------------------------------------------------
>
>                 Key: YARN-9968
>                 URL: https://issues.apache.org/jira/browse/YARN-9968
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.1.0
>            Reporter: Tarun Parimi
>            Assignee: Tarun Parimi
>            Priority: Major
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
>         at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>         at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>         at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to