[jira] [Commented] (MAPREDUCE-6826) Job fails with InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING

2018-05-07 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466353#comment-16466353
 ] 

Varun Saxena commented on MAPREDUCE-6826:
-

[~BilwaST], thanks for the patch.
Can you fix the errors due to tabs?
Also can you add a test case for COMMITTING-> JOB_TASK_COMPLETED transition as 
well by using WaitingOutputCommitter.
Also name the test case method in a way that it reflects the transition being 
tested. Something like TestJobTaskCompletedWhileCommitting, for instance.

> Job fails with InvalidStateTransitonException: Invalid event: 
> JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING
> 
>
> Key: MAPREDUCE-6826
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6826
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Varun Saxena
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-6826-001.patch, MAPREDUCE-6826-002.patch
>
>
> This happens if a container is preempted by scheduler after job starts 
> committing.
> And this exception in turn leads to application being marked as FAILED in 
> YARN.
> I think we can probably ignore JOB_TASK_COMPLETED event while JobImpl state 
> is COMMITTING or SUCCEEDED as job is in the process of finishing.
> Also is there any point in attempting to scheduler another task attempt if 
> job is already in COMMITTING or SUCCEEDED state.
> {noformat}
> 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: 
> task_1482404625971_23910_m_04 Task Transitioned from RUNNING to SUCCEEDED
> 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 5
> 2016-12-23 09:10:38,643 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1482404625971_23910Job Transitioned from RUNNING to COMMITTING
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
> the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e55_1482404625971_23910_01_10 taskAttempt 
> attempt_1482404625971_23910_m_04_1
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
> attempt_1482404625971_23910_m_04_1
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: 
> Opening proxy : linux-19:26009
> 2016-12-23 09:10:38,644 INFO [CommitterEvent Processor #4] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT
> 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : 
> jvm_1482404625971_23910_m_60473139527690 asked for a task
> 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: 
> jvm_1482404625971_23910_m_60473139527690 is invalid and will be killed.
> 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for 
> JobFinishedEvent 
> 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1482404625971_23910Job Transitioned from COMMITTING to SUCCEEDED
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Job finished cleanly, 
> recording last MRAppMaster retry
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: RMCommunicator notified 
> that shouldUnregistered is: true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: 
> true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
> JobHistoryEventHandler notified that forceJobCompletion is true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the 
> services
> 2016-12-23 09:10:38,800 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 1
> 2016-12-23 09:10:38,989 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:0 Schedu

[jira] [Comment Edited] (MAPREDUCE-6826) Job fails with InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING

2018-05-07 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466353#comment-16466353
 ] 

Varun Saxena edited comment on MAPREDUCE-6826 at 5/7/18 7:28 PM:
-

[~BilwaST], thanks for the patch.
Can you fix the errors due to tabs?
Also can you add a test case for COMMITTING-> JOB_TASK_COMPLETED transition as 
well by using WaitingOutputCommitter.
Additionally, name the test case method in a way that it reflects the 
transition being tested. Something like TestJobTaskCompletedWhileCommitting, 
for instance.


was (Author: varun_saxena):
[~BilwaST], thanks for the patch.
Can you fix the errors due to tabs?
Also can you add a test case for COMMITTING-> JOB_TASK_COMPLETED transition as 
well by using WaitingOutputCommitter.
Also name the test case method in a way that it reflects the transition being 
tested. Something like TestJobTaskCompletedWhileCommitting, for instance.

> Job fails with InvalidStateTransitonException: Invalid event: 
> JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING
> 
>
> Key: MAPREDUCE-6826
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6826
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Varun Saxena
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-6826-001.patch, MAPREDUCE-6826-002.patch
>
>
> This happens if a container is preempted by scheduler after job starts 
> committing.
> And this exception in turn leads to application being marked as FAILED in 
> YARN.
> I think we can probably ignore JOB_TASK_COMPLETED event while JobImpl state 
> is COMMITTING or SUCCEEDED as job is in the process of finishing.
> Also is there any point in attempting to scheduler another task attempt if 
> job is already in COMMITTING or SUCCEEDED state.
> {noformat}
> 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: 
> task_1482404625971_23910_m_04 Task Transitioned from RUNNING to SUCCEEDED
> 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 5
> 2016-12-23 09:10:38,643 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1482404625971_23910Job Transitioned from RUNNING to COMMITTING
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
> the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e55_1482404625971_23910_01_10 taskAttempt 
> attempt_1482404625971_23910_m_04_1
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
> attempt_1482404625971_23910_m_04_1
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: 
> Opening proxy : linux-19:26009
> 2016-12-23 09:10:38,644 INFO [CommitterEvent Processor #4] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT
> 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : 
> jvm_1482404625971_23910_m_60473139527690 asked for a task
> 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: 
> jvm_1482404625971_23910_m_60473139527690 is invalid and will be killed.
> 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for 
> JobFinishedEvent 
> 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1482404625971_23910Job Transitioned from COMMITTING to SUCCEEDED
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Job finished cleanly, 
> recording last MRAppMaster retry
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: RMCommunicator notified 
> that shouldUnregistered is: true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: 
> true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
> JobHistoryEventHandler notified that forceJobCompletion is true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.a

[jira] [Updated] (MAPREDUCE-6826) Job fails with InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING

2018-05-07 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated MAPREDUCE-6826:

Summary: Job fails with InvalidStateTransitonException: Invalid event: 
JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING  (was: Job fails with 
InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at SUCCEEDED)

> Job fails with InvalidStateTransitonException: Invalid event: 
> JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING
> 
>
> Key: MAPREDUCE-6826
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6826
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Varun Saxena
>Assignee: Bilwa S T
>Priority: Major
> Attachments: MAPREDUCE-6826-001.patch, MAPREDUCE-6826-002.patch
>
>
> This happens if a container is preempted by scheduler after job starts 
> committing.
> And this exception in turn leads to application being marked as FAILED in 
> YARN.
> I think we can probably ignore JOB_TASK_COMPLETED event while JobImpl state 
> is COMMITTING or SUCCEEDED as job is in the process of finishing.
> Also is there any point in attempting to scheduler another task attempt if 
> job is already in COMMITTING or SUCCEEDED state.
> {noformat}
> 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: 
> task_1482404625971_23910_m_04 Task Transitioned from RUNNING to SUCCEEDED
> 2016-12-23 09:10:38,642 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 5
> 2016-12-23 09:10:38,643 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1482404625971_23910Job Transitioned from RUNNING to COMMITTING
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
> the event EventType: CONTAINER_REMOTE_CLEANUP for container 
> container_e55_1482404625971_23910_01_10 taskAttempt 
> attempt_1482404625971_23910_m_04_1
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
> attempt_1482404625971_23910_m_04_1
> 2016-12-23 09:10:38,644 INFO [ContainerLauncher #5] 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: 
> Opening proxy : linux-19:26009
> 2016-12-23 09:10:38,644 INFO [CommitterEvent Processor #4] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT
> 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : 
> jvm_1482404625971_23910_m_60473139527690 asked for a task
> 2016-12-23 09:10:38,724 INFO [IPC Server handler 0 on 27113] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: 
> jvm_1482404625971_23910_m_60473139527690 is invalid and will be killed.
> 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for 
> JobFinishedEvent 
> 2016-12-23 09:10:38,797 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1482404625971_23910Job Transitioned from COMMITTING to SUCCEEDED
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Job finished cleanly, 
> recording last MRAppMaster retry
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> 2016-12-23 09:10:38,798 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: RMCommunicator notified 
> that shouldUnregistered is: true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: 
> true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
> JobHistoryEventHandler notified that forceJobCompletion is true
> 2016-12-23 09:10:38,799 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the 
> services
> 2016-12-23 09:10:38,800 INFO [Thread-93] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 1
> 2016-12-23 09:10:38,989 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:5 CompletedReds:0 ContAlloc:8 ContRel:0 
> HostLocal:0 RackLocal:0
> 2016-12-23 09:10:38,993 INFO [RMCom

[jira] [Comment Edited] (MAPREDUCE-7053) Timed out tasks can fail to produce thread dump

2018-05-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367455#comment-16367455
 ] 

Eric Payne edited comment on MAPREDUCE-7053 at 5/7/18 1:15 PM:
---

Thanks [~jlowe].

I committed MAPREDUCE-7053.001.patch to trunk, and cherry-picked to branch-3.1, 
branch-3.0, and -branch-3.0.1-.
 I committed MAPREDUCE-7053-branch-2.001.patch branch-2, branch-2.9 and 
branch-2.8


was (Author: eepayne):
Thanks [~jlowe].

I committed MAPREDUCE-7053.001.patch to trunk, and cherry-picked to branch-3.1, 
branch-3.0, and branch-3.0.1.
I committed MAPREDUCE-7053-branch-2.001.patch branch-2, branch-2.9 and 
branch-2.8

> Timed out tasks can fail to produce thread dump
> ---
>
> Key: MAPREDUCE-7053
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7053
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 3.0.3
>
> Attachments: MAPREDUCE-7053-branch-2.001.patch, 
> MAPREDUCE-7053.001.patch
>
>
> TestMRJobs#testThreadDumpOnTaskTimeout has been failing sporadically 
> recently.  When the AM times out a task it immediately removes it from the 
> list of known tasks and then connects to the NM to request a thread dump 
> followed by a kill.  If the task heartbeats in after the task has been 
> removed from the list of known tasks but before the thread dump signal 
> arrives then the task can exit with a "org.apache.hadoop.mapred.Task: Parent 
> died." message and no thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7053) Timed out tasks can fail to produce thread dump

2018-05-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465895#comment-16465895
 ] 

Eric Payne commented on MAPREDUCE-7053:
---

bq. Thanks for the work here. I noticed that you reverted it from 3.0.2, but 
per your comment above, it's in branch-3.0.1.
[~yzhangal], It was reverted from branch-3.0.1 as well. Sorry about the 
confusion.


> Timed out tasks can fail to produce thread dump
> ---
>
> Key: MAPREDUCE-7053
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7053
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 3.0.3
>
> Attachments: MAPREDUCE-7053-branch-2.001.patch, 
> MAPREDUCE-7053.001.patch
>
>
> TestMRJobs#testThreadDumpOnTaskTimeout has been failing sporadically 
> recently.  When the AM times out a task it immediately removes it from the 
> list of known tasks and then connects to the NM to request a thread dump 
> followed by a kill.  If the task heartbeats in after the task has been 
> removed from the list of known tasks but before the thread dump signal 
> arrives then the task can exit with a "org.apache.hadoop.mapred.Task: Parent 
> died." message and no thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org