[jira] [Commented] (TEZ-3932) TaskSchedulerManager can throw NullPointerException during DAGAppMaster container cleanup race

2018-05-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468960#comment-16468960
 ] 

Jason Lowe commented on TEZ-3932:
-

Thanks for the patch!  +1 lgtm.  Committing this.

> TaskSchedulerManager can throw NullPointerException during DAGAppMaster 
> container cleanup race
> --
>
> Key: TEZ-3932
> URL: https://issues.apache.org/jira/browse/TEZ-3932
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: arch: x86 and ppc
> java: openjdk version "1.8.0_161"
>  OpenJDK Runtime Environment (build 1.8.0_161-b14)
>  OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>Reporter: Valencia Edna Serrao
>Assignee: Jonathan Eagles
>Priority: Major
>  Labels: ppc, x86
> Attachments: TEZ-3932.001.patch, TEZ-3932.fail.patch, 
> org.apache.tez.test.TestExceptionPropagation-output.txt
>
>
> Test 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession 
> on x86 and ppc. I found related JIRA's TEZ-3746 and TEZ-3748. Though the 
> issue is marked as resolved in the related JIRA's, the issue exists. Below 
> are the error details:
> {code:java}
> ---
> Test set: org.apache.tez.test.TestExceptionPropagation
> ---
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 96.433 sec 
> <<< FAILURE!
> testExceptionPropagationSession(org.apache.tez.test.TestExceptionPropagation) 
>  Time elapsed: 52.7 sec  <<< ERROR!
> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
> applicationId=application_1525667420557_0001, yarnApplicationState=FAILED, 
> finalApplicationStatus=FAILED, trackingUrl=N/A, diagnostics=[DAG completed 
> with an ERROR state. Shutting down AM, Session stats:submittedDAGs=11, 
> successfulDAGs=0, failedDAGs=12, killedDAGs=0]
>     at 
> org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:910)
>     at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:1024)
>     at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:1034)
>     at 
> org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:652)
>     at org.apache.tez.client.TezClient.submitDAG(TezClient.java:588)
>     at 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession(TestExceptionPropagation.java:227
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3932) TaskSchedulerManager can throw NullPointerException during DAGAppMaster container cleanup race

2018-05-09 Thread Valencia Edna Serrao (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468775#comment-16468775
 ] 

Valencia Edna Serrao commented on TEZ-3932:
---

Great to see the initial patch, [~jeagles]. Looking forward to see the fix 
upstreamed.

> TaskSchedulerManager can throw NullPointerException during DAGAppMaster 
> container cleanup race
> --
>
> Key: TEZ-3932
> URL: https://issues.apache.org/jira/browse/TEZ-3932
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: arch: x86 and ppc
> java: openjdk version "1.8.0_161"
>  OpenJDK Runtime Environment (build 1.8.0_161-b14)
>  OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>Reporter: Valencia Edna Serrao
>Assignee: Jonathan Eagles
>Priority: Major
>  Labels: ppc, x86
> Attachments: TEZ-3932.001.patch, TEZ-3932.fail.patch, 
> org.apache.tez.test.TestExceptionPropagation-output.txt
>
>
> Test 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession 
> on x86 and ppc. I found related JIRA's TEZ-3746 and TEZ-3748. Though the 
> issue is marked as resolved in the related JIRA's, the issue exists. Below 
> are the error details:
> {code:java}
> ---
> Test set: org.apache.tez.test.TestExceptionPropagation
> ---
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 96.433 sec 
> <<< FAILURE!
> testExceptionPropagationSession(org.apache.tez.test.TestExceptionPropagation) 
>  Time elapsed: 52.7 sec  <<< ERROR!
> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
> applicationId=application_1525667420557_0001, yarnApplicationState=FAILED, 
> finalApplicationStatus=FAILED, trackingUrl=N/A, diagnostics=[DAG completed 
> with an ERROR state. Shutting down AM, Session stats:submittedDAGs=11, 
> successfulDAGs=0, failedDAGs=12, killedDAGs=0]
>     at 
> org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:910)
>     at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:1024)
>     at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:1034)
>     at 
> org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:652)
>     at org.apache.tez.client.TezClient.submitDAG(TezClient.java:588)
>     at 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession(TestExceptionPropagation.java:227
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3932) TaskSchedulerManager can throw NullPointerException during DAGAppMaster container cleanup race

2018-05-08 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468042#comment-16468042
 ] 

Jonathan Eagles commented on TEZ-3932:
--

[~jlowe], can you have a look at this NullPointerProtection patch? Test failure 
is due to an unrelated timeout (probably should be bumped higher)

> TaskSchedulerManager can throw NullPointerException during DAGAppMaster 
> container cleanup race
> --
>
> Key: TEZ-3932
> URL: https://issues.apache.org/jira/browse/TEZ-3932
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: arch: x86 and ppc
> java: openjdk version "1.8.0_161"
>  OpenJDK Runtime Environment (build 1.8.0_161-b14)
>  OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>Reporter: Valencia Edna Serrao
>Assignee: Jonathan Eagles
>Priority: Major
>  Labels: ppc, x86
> Attachments: TEZ-3932.001.patch, TEZ-3932.fail.patch, 
> org.apache.tez.test.TestExceptionPropagation-output.txt
>
>
> Test 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession 
> on x86 and ppc. I found related JIRA's TEZ-3746 and TEZ-3748. Though the 
> issue is marked as resolved in the related JIRA's, the issue exists. Below 
> are the error details:
> {code:java}
> ---
> Test set: org.apache.tez.test.TestExceptionPropagation
> ---
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 96.433 sec 
> <<< FAILURE!
> testExceptionPropagationSession(org.apache.tez.test.TestExceptionPropagation) 
>  Time elapsed: 52.7 sec  <<< ERROR!
> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
> applicationId=application_1525667420557_0001, yarnApplicationState=FAILED, 
> finalApplicationStatus=FAILED, trackingUrl=N/A, diagnostics=[DAG completed 
> with an ERROR state. Shutting down AM, Session stats:submittedDAGs=11, 
> successfulDAGs=0, failedDAGs=12, killedDAGs=0]
>     at 
> org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:910)
>     at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:1024)
>     at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:1034)
>     at 
> org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:652)
>     at org.apache.tez.client.TezClient.submitDAG(TezClient.java:588)
>     at 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession(TestExceptionPropagation.java:227
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3932) TaskSchedulerManager can throw NullPointerException during DAGAppMaster container cleanup race

2018-05-08 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468033#comment-16468033
 ] 

TezQA commented on TEZ-3932:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12922523/TEZ-3932.001.patch
  against master revision 081a64f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   
org.apache.tez.runtime.library.conf.TestUnorderedPartitionedKVEdgeConfig

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2794//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2794//console

This message is automatically generated.


> TaskSchedulerManager can throw NullPointerException during DAGAppMaster 
> container cleanup race
> --
>
> Key: TEZ-3932
> URL: https://issues.apache.org/jira/browse/TEZ-3932
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: arch: x86 and ppc
> java: openjdk version "1.8.0_161"
>  OpenJDK Runtime Environment (build 1.8.0_161-b14)
>  OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>Reporter: Valencia Edna Serrao
>Assignee: Jonathan Eagles
>Priority: Major
>  Labels: ppc, x86
> Attachments: TEZ-3932.001.patch, TEZ-3932.fail.patch, 
> org.apache.tez.test.TestExceptionPropagation-output.txt
>
>
> Test 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession 
> on x86 and ppc. I found related JIRA's TEZ-3746 and TEZ-3748. Though the 
> issue is marked as resolved in the related JIRA's, the issue exists. Below 
> are the error details:
> {code:java}
> ---
> Test set: org.apache.tez.test.TestExceptionPropagation
> ---
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 96.433 sec 
> <<< FAILURE!
> testExceptionPropagationSession(org.apache.tez.test.TestExceptionPropagation) 
>  Time elapsed: 52.7 sec  <<< ERROR!
> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
> applicationId=application_1525667420557_0001, yarnApplicationState=FAILED, 
> finalApplicationStatus=FAILED, trackingUrl=N/A, diagnostics=[DAG completed 
> with an ERROR state. Shutting down AM, Session stats:submittedDAGs=11, 
> successfulDAGs=0, failedDAGs=12, killedDAGs=0]
>     at 
> org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:910)
>     at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:1024)
>     at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:1034)
>     at 
> org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:652)
>     at org.apache.tez.client.TezClient.submitDAG(TezClient.java:588)
>     at 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession(TestExceptionPropagation.java:227
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3932) TaskSchedulerManager can throw NullPointerException during DAGAppMaster container cleanup race

2018-05-08 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467951#comment-16467951
 ] 

Jonathan Eagles commented on TEZ-3932:
--

[~vserrao], thank you for providing the test logs as I was able to create a 
reliable test case that reproduces this issue. I was able to create an initial 
patch that will remove this intermittent issue you have been facing and I will 
work with the community to get this checked in. This logs show that this is not 
just a test issue but could happen in practice during shutdown scenarios. 

> TaskSchedulerManager can throw NullPointerException during DAGAppMaster 
> container cleanup race
> --
>
> Key: TEZ-3932
> URL: https://issues.apache.org/jira/browse/TEZ-3932
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: arch: x86 and ppc
> java: openjdk version "1.8.0_161"
>  OpenJDK Runtime Environment (build 1.8.0_161-b14)
>  OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>Reporter: Valencia Edna Serrao
>Assignee: Jonathan Eagles
>Priority: Major
>  Labels: ppc, x86
> Attachments: TEZ-3932.001.patch, TEZ-3932.fail.patch, 
> org.apache.tez.test.TestExceptionPropagation-output.txt
>
>
> Test 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession 
> on x86 and ppc. I found related JIRA's TEZ-3746 and TEZ-3748. Though the 
> issue is marked as resolved in the related JIRA's, the issue exists. Below 
> are the error details:
> {code:java}
> ---
> Test set: org.apache.tez.test.TestExceptionPropagation
> ---
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 96.433 sec 
> <<< FAILURE!
> testExceptionPropagationSession(org.apache.tez.test.TestExceptionPropagation) 
>  Time elapsed: 52.7 sec  <<< ERROR!
> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
> applicationId=application_1525667420557_0001, yarnApplicationState=FAILED, 
> finalApplicationStatus=FAILED, trackingUrl=N/A, diagnostics=[DAG completed 
> with an ERROR state. Shutting down AM, Session stats:submittedDAGs=11, 
> successfulDAGs=0, failedDAGs=12, killedDAGs=0]
>     at 
> org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:910)
>     at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:1024)
>     at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:1034)
>     at 
> org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:652)
>     at org.apache.tez.client.TezClient.submitDAG(TezClient.java:588)
>     at 
> org.apache.tez.test.TestExceptionPropagation.testExceptionPropagationSession(TestExceptionPropagation.java:227
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)