[jira] [Updated] (MAPREDUCE-7048) Uber AM can crash due to unknown task in statusUpdate

2018-02-12 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-7048:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.7.6
   2.8.4
   2.9.1
   2.10.0
   3.0.1
   3.1.0
   Status: Resolved  (was: Patch Available)

Thanks, [~pbacsko]!  I committed this to trunk, branch-3.1, branch-3.0, 
branch-3.0.1, branch-2, branch-2.9, branch-2.8, and branch-2.7.

> Uber AM can crash due to unknown task in statusUpdate
> -
>
> Key: MAPREDUCE-7048
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7048
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
>
> Attachments: MAPREDUCE-7048-001.patch, MAPREDUCE-7048-002.patch, 
> MAPREDUCE-7048-003.patch, MAPREDUCE-7048-branch-2.01.patch, 
> MAPREDUCE-7048-branch-2.7.01.patch, MAPREDUCE-7048-branch-2.7.01.patch, 
> MAPREDUCE-7048-branch-2.8.01.patch, MAPREDUCE-7048-branch-2.9.01.patch
>
>
> The testcase TestUberAM#testThreadDumpOnTaskTimeout was supposed to be fixed 
> by MAPREDUCE-7020. However, it still fails, see: 
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7325/testReport/junit/org.apache.hadoop.mapreduce.v2/TestMRJobs/testThreadDumpOnTaskTimeout/
>  (note: other tests failed as well, but those look unrelated).
> When I tried to reproduce it locally, it failed again, although with a 
> slightly different error message (it was actually the same as before):
> {noformat}
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running org.apache.hadoop.mapreduce.v2.TestUberAM
> [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 128.192 s <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.TestUberAM
> [ERROR] 
> testThreadDumpOnTaskTimeout(org.apache.hadoop.mapreduce.v2.TestUberAM)  Time 
> elapsed: 79.539 s  <<< FAILURE!
> java.lang.AssertionError: No AppMaster log found! expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.mapreduce.v2.TestMRJobs.testThreadDumpOnTaskTimeout(TestMRJobs.java:1228)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
> *Root cause:* {{System.exit()}} is still invoked at {{Task.statusUpdate()}}
> {noformat}
>   public void statusUpdate(TaskUmbilicalProtocol umbilical) 
>   throws IOException {
> int retries = MAX_RETRIES;
> while (true) {
>   try {
> if (!umbilical.statusUpdate(getTaskID(), taskStatus).getTaskFound()) {
>   LOG.warn("Parent died.  Exiting "+taskId);
>   System.exit(66);
> }
> taskStatus.clearStatus();
> return;
> ...
> {noformat}
> At this point, the task was not found and return value of 
> {{umbilical.statusUpdate()}} is false. Checking whether we run in uber mode 
> seems to solve the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-7048) Uber AM can crash due to unknown task in statusUpdate

2018-02-12 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-7048:
--
Summary: Uber AM can crash due to unknown task in statusUpdate  (was: AM 
can still crash after MAPREDUCE-7020)

> Uber AM can crash due to unknown task in statusUpdate
> -
>
> Key: MAPREDUCE-7048
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7048
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: MAPREDUCE-7048-001.patch, MAPREDUCE-7048-002.patch, 
> MAPREDUCE-7048-003.patch, MAPREDUCE-7048-branch-2.01.patch, 
> MAPREDUCE-7048-branch-2.7.01.patch, MAPREDUCE-7048-branch-2.7.01.patch, 
> MAPREDUCE-7048-branch-2.8.01.patch, MAPREDUCE-7048-branch-2.9.01.patch
>
>
> The testcase TestUberAM#testThreadDumpOnTaskTimeout was supposed to be fixed 
> by MAPREDUCE-7020. However, it still fails, see: 
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7325/testReport/junit/org.apache.hadoop.mapreduce.v2/TestMRJobs/testThreadDumpOnTaskTimeout/
>  (note: other tests failed as well, but those look unrelated).
> When I tried to reproduce it locally, it failed again, although with a 
> slightly different error message (it was actually the same as before):
> {noformat}
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running org.apache.hadoop.mapreduce.v2.TestUberAM
> [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 128.192 s <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.TestUberAM
> [ERROR] 
> testThreadDumpOnTaskTimeout(org.apache.hadoop.mapreduce.v2.TestUberAM)  Time 
> elapsed: 79.539 s  <<< FAILURE!
> java.lang.AssertionError: No AppMaster log found! expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.mapreduce.v2.TestMRJobs.testThreadDumpOnTaskTimeout(TestMRJobs.java:1228)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
> *Root cause:* {{System.exit()}} is still invoked at {{Task.statusUpdate()}}
> {noformat}
>   public void statusUpdate(TaskUmbilicalProtocol umbilical) 
>   throws IOException {
> int retries = MAX_RETRIES;
> while (true) {
>   try {
> if (!umbilical.statusUpdate(getTaskID(), taskStatus).getTaskFound()) {
>   LOG.warn("Parent died.  Exiting "+taskId);
>   System.exit(66);
> }
> taskStatus.clearStatus();
> return;
> ...
> {noformat}
> At this point, the task was not found and return value of 
> {{umbilical.statusUpdate()}} is false. Checking whether we run in uber mode 
> seems to solve the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org