[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980094#comment-16980094 ] Hudson commented on YARN-9968: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17688 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17688/]) YARN-9968. Public Localizer is exiting in NodeManager due to (snemeth: rev 4c1a1287bc58390900ba1c79818d3ba491c4862c) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9968.001.patch > > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980082#comment-16980082 ] Szilard Nemeth commented on YARN-9968: -- Hi [~tarunparimi]! Your [explanation|https://issues.apache.org/jira/browse/YARN-9968?focusedCommentId=16973352=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16973352] makes sense. Thank you for spending time with this and that you added a good explanation. I can't think of anything more than adding this null-check as the exception is logged anyways, so committed to trunk! Thanks for your contribution. > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9968.001.patch > > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979324#comment-16979324 ] Tarun Parimi commented on YARN-9968: [~snemeth] , Please review this when you get time. > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9968.001.patch > > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973475#comment-16973475 ] Hadoop QA commented on YARN-9968: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 53s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 34s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 28s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 72m 47s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | YARN-9968 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985748/YARN-9968.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux ea72ba70c0cf 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / df6b316 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25156/testReport/ | | Max. process+thread count | 413 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25156/console | | Powered by | Apache Yetus 0.8.0
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973376#comment-16973376 ] Szilard Nemeth commented on YARN-9968: -- Thanks for this investigation [~tarunparimi]! Waiting for the patch. Will help you with reviews and commit! > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973352#comment-16973352 ] Tarun Parimi commented on YARN-9968: [~snemeth], I was finally able reproduce it artificially in my test cluster. I added the below the sleep and subsequent exception in FSDownload class to simulate the hdfs not responding for a minute and then throwing the exception while trying to download. When the application which requested the resource gets killed during the minute when the thread sleeps, I got null pointer issue and public localizer exited. {code:java} try { Thread.sleep(6); throw new ExecutionException("Test", new IOException("Exception")); } catch (InterruptedException e) { throw new IOException(e); } >From this I understood that the issue occurs when the below sequence of events >occur, 1. The public localizer is waiting on the download of a file from hdfs for quite some time. 2. Application get killed/failed while the download is still waiting/sleeping. Due to this the app cleanup is triggered, which removes the LocalResourcesTracker for that app. {code:java} private void handleDestroyApplicationResources(Application application) { String userName = application.getUser(); ApplicationId appId = application.getAppId(); String appIDStr = application.toString(); LocalResourcesTracker appLocalRsrcsTracker = appRsrc.remove(appId.toString()); {code} 3. The download finally fails and it throws an exception from HDFS. 4. Since the tracker was removed due to app kill, we get the NullPointer in below code as tracker is null . This causes public localizer to exit and not handle future localization requests. {code:java} tracker.handle(new ResourceFailedLocalizationEvent( assoc.getResource().getRequest(), diagnostics)); {code} This issue is introduced due to the changes in YARN-8403 , where the failed localization is notified to the app for logging in the AM. I think handling a null check and preventing this should be safe as the AM is already killed in this scenario. Will provide an initial patch based on this. cc [~prabhujoseph] > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe,
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972420#comment-16972420 ] Tarun Parimi commented on YARN-9968: Hi [~snemeth]. Thanks for looking into this. The issue is not reproducing for me so far. This is happening on a heavily loaded prod cluster. The cluster also is configured to use DefaultContainerExecutor , so the localizing is all done completely inside the NM jvm process. The null pointer occurs in the below code where tracker.handle() is called. Looks like tracker is becoming null for some reason. Doing a null check on tracker might be a simple workaround, but understanding how the issue occurred might give us a better way to fix this. {code:java} final String diagnostics = "Failed to download resource " + assoc.getResource() + " " + e.getCause(); tracker.handle(new ResourceFailedLocalizationEvent( assoc.getResource().getRequest(), diagnostics)); {code} There are also multiple HDFS warnings while doing localization in the log just before this NullPointerException. So I think those HDFS issues while localizing are definitely related and are causing the issue in the first place. But I haven't completely figured out how. {code:java} WARN impl.BlockReaderFactory (BlockReaderFactory.java:getRemoteBlockReaderFromTcp(764)) - I/O error constructing remote block reader. java.io.IOException: Got error, status=ERROR, status message opReadBlock BP-290360126-127.0.0.1-1559634768162:blk_3454574939_2740457478 received exception java.io.IOException: No data exists for block BP-290360126-127.0.0.1-1559634768162:blk_blk_3454574939_2740457478, for OP_READ_BLOCK, self=/127.0.0.1:15810, remote=/127.0.0.1:50010, for file /tmp/hadoop-yarn/staging/job-user/.staging/job_1571858983080_36874/job.jar, for pool BP-290360126-127.0.0.1-1559634768162 block 3814574939_2740867478 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110) at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReaderRemote.java:440) at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:408) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:641) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:572) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:754) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100) at org.apache.commons.io.input.TeeInputStream.read(TeeInputStream.java:129) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.PushbackInputStream.read(PushbackInputStream.java:186) at java.util.zip.ZipInputStream.readFully(ZipInputStream.java:403) at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:278) at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:122) at java.util.jar.JarInputStream.(JarInputStream.java:83) at java.util.jar.JarInputStream.(JarInputStream.java:62) at org.apache.hadoop.util.RunJar.unJar(RunJar.java:114) at org.apache.hadoop.util.RunJar.unJarAndSave(RunJar.java:167) at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:354) at org.apache.hadoop.yarn.util.FSDownload.downloadAndUnpack(FSDownload.java:303) at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:283) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242) at
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972413#comment-16972413 ] Szilard Nemeth commented on YARN-9968: -- Hi [~tarunparimi]! Could you please add reproduction steps? Thanks! > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org