[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980094#comment-16980094
 ] 

Hudson commented on YARN-9968:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17688 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17688/])
YARN-9968. Public Localizer is exiting in NodeManager due to (snemeth: rev 
4c1a1287bc58390900ba1c79818d3ba491c4862c)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java


> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9968.001.patch
>
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-22 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980082#comment-16980082
 ] 

Szilard Nemeth commented on YARN-9968:
--

Hi [~tarunparimi]!
Your 
[explanation|https://issues.apache.org/jira/browse/YARN-9968?focusedCommentId=16973352=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16973352]
 makes sense. Thank you for spending time with this and that you added a good 
explanation.
I can't think of anything more than adding this null-check as the exception is 
logged anyways, so committed to trunk!
Thanks for your contribution.

> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9968.001.patch
>
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-21 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979324#comment-16979324
 ] 

Tarun Parimi commented on YARN-9968:


[~snemeth] , Please review this when you get time. 

> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9968.001.patch
>
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-13 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973475#comment-16973475
 ] 

Hadoop QA commented on YARN-9968:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
 2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 53s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 34s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 
28s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 72m 47s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | YARN-9968 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985748/YARN-9968.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ea72ba70c0cf 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 
16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / df6b316 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/25156/testReport/ |
| Max. process+thread count | 413 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/25156/console |
| Powered by | Apache Yetus 0.8.0   

[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-13 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973376#comment-16973376
 ] 

Szilard Nemeth commented on YARN-9968:
--

Thanks for this investigation [~tarunparimi]!
Waiting for the patch. Will help you with reviews and commit!


> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-13 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973352#comment-16973352
 ] 

Tarun Parimi commented on YARN-9968:


[~snemeth], I was finally able reproduce it artificially in my test cluster. I 
added the below the sleep and subsequent exception in FSDownload class to 
simulate the hdfs not responding for a minute and then throwing the exception 
while trying to download. When the application which requested the resource 
gets killed during the minute when the thread sleeps, I got null pointer issue 
and public localizer exited.

{code:java}
  try {
Thread.sleep(6);
 throw new ExecutionException("Test", new IOException("Exception"));
  } catch (InterruptedException e) {
throw new IOException(e);
  }

>From this I understood that the issue occurs when the below sequence of events 
>occur,

1. The public localizer is waiting on the download of a file from hdfs for 
quite some time.
2. Application get killed/failed while the download is still waiting/sleeping. 
Due to this the app cleanup is triggered, which removes the 
LocalResourcesTracker for that app.

{code:java}
  private void handleDestroyApplicationResources(Application application) {
String userName = application.getUser();
ApplicationId appId = application.getAppId();
String appIDStr = application.toString();
LocalResourcesTracker appLocalRsrcsTracker =
  appRsrc.remove(appId.toString());
{code}

3. The download finally fails and it throws an exception from HDFS.
4. Since the tracker was removed due to app kill, we get the NullPointer in 
below code as tracker is null . This causes public localizer to exit and not 
handle future localization requests.
{code:java}
  tracker.handle(new ResourceFailedLocalizationEvent(
  assoc.getResource().getRequest(), diagnostics));
{code}

This issue is introduced due to the changes in YARN-8403 , where the failed 
localization is notified to the app for logging in the AM.

I think handling a null check and preventing this should be safe as the AM is 
already killed in this scenario. Will provide an initial patch based on this.

cc [~prabhujoseph]



> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, 

[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-12 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972420#comment-16972420
 ] 

Tarun Parimi commented on YARN-9968:


Hi [~snemeth]. Thanks for looking into this.
The issue is not reproducing for me so far. This is happening on a heavily 
loaded prod cluster. The cluster also is configured to use 
DefaultContainerExecutor , so the localizing is all done completely inside the 
NM jvm process.

The null pointer occurs in the below code where tracker.handle() is called. 
Looks like tracker is becoming null for some reason. Doing a null check on 
tracker might be a simple workaround, but understanding how the issue occurred 
might give us a better way to fix this.
{code:java}
 final String diagnostics = "Failed to download resource " +
  assoc.getResource() + " " + e.getCause();
  tracker.handle(new ResourceFailedLocalizationEvent(
  assoc.getResource().getRequest(), diagnostics));
{code}

There are also multiple HDFS warnings while doing localization in the log just 
before this NullPointerException. So I think those HDFS issues while localizing 
are definitely related and are causing the issue in the first place. But I 
haven't completely figured out how.

{code:java}
WARN  impl.BlockReaderFactory 
(BlockReaderFactory.java:getRemoteBlockReaderFromTcp(764)) - I/O error 
constructing remote block reader.
java.io.IOException: Got error, status=ERROR, status message opReadBlock 
BP-290360126-127.0.0.1-1559634768162:blk_3454574939_2740457478 received 
exception java.io.IOException: No data exists for block 
BP-290360126-127.0.0.1-1559634768162:blk_blk_3454574939_2740457478, for 
OP_READ_BLOCK, self=/127.0.0.1:15810, remote=/127.0.0.1:50010, for file 
/tmp/hadoop-yarn/staging/job-user/.staging/job_1571858983080_36874/job.jar, for 
pool BP-290360126-127.0.0.1-1559634768162 block 3814574939_2740867478
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReaderRemote.java:440)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:408)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:641)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:572)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:754)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100)
at 
org.apache.commons.io.input.TeeInputStream.read(TeeInputStream.java:129)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
at java.util.zip.ZipInputStream.readFully(ZipInputStream.java:403)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:278)
at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:122)
at java.util.jar.JarInputStream.(JarInputStream.java:83)
at java.util.jar.JarInputStream.(JarInputStream.java:62)
at org.apache.hadoop.util.RunJar.unJar(RunJar.java:114)
at org.apache.hadoop.util.RunJar.unJarAndSave(RunJar.java:167)
at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:354)
at 
org.apache.hadoop.yarn.util.FSDownload.downloadAndUnpack(FSDownload.java:303)
at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:283)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at 

[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-12 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972413#comment-16972413
 ] 

Szilard Nemeth commented on YARN-9968:
--

Hi [~tarunparimi]!
Could you please add reproduction steps? Thanks!

> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org