[jira] [Updated] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly

2016-12-08 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated TEZ-3549:
-
Attachment: TEZ-3549-branch-0.7.001.patch

Thanks [~jeagles]. Uploading branch-0.7 version of the patch.

> TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS 
> correctly
> -
>
> Key: TEZ-3549
> URL: https://issues.apache.org/jira/browse/TEZ-3549
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Fix For: 0.7.2, 0.9.0, 0.8.5
>
> Attachments: TEZ-3549-branch-0.7.001.patch, TEZ-3549.001.patch, 
> TEZ-3549.002.patch, TEZ-3549.003.patch
>
>
> {code}
> hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar  orderedwordcount  
> -Dtez.task.progress.stuck.interval-ms=70 /input  /output
> {code}
> With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=30}}
> fails as follows:
> {code}
>  INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
>  INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, 
> vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, 
> taskId=task_123_456_1_00_07, diagnostics=[TaskAttempt 0 failed, 
> info=[Attempt failed because it appears to make no progress for 70ms], 
> TaskAttempt 1 failed, info=[Attempt failed because it appears to make no 
> progress for 70ms], TaskAttempt 2 failed, info=[Attempt failed because it 
> appears to make no progress for 70ms], TaskAttempt 3 failed, 
> info=[Attempt failed because it appears to make no progress for 70ms]], 
> Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, 
> Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due 
> to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, 
> vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in 
> RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
> failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] 
> killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, 
> vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex 
> received Kill while in RUNNING state., Vertex did not succeed due to 
> OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 
> [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed 
> due to VERTEX_FAILURE. failedVertices:1 killedVertices:2]
> {code}
> This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 
> until progress gets notified and the following code always evaluates to true 
> if this config is set/non-zero.
> TaskUpdaterTransition:
> {code}
>   if (statusEvent.getProgressNotified()) {
> ta.lastNotifyProgressTimestamp = ta.clock.getTime();
>   } else {
> long currTime = ta.clock.getTime();
> if (ta.hungIntervalMax > 0 &&
> currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) {
> {code}
> Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so 
> that this comparison is valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly

2016-12-01 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated TEZ-3549:
-
Attachment: TEZ-3549.003.patch

Thanks [~jeagles]. Updated patch to address the test failure. Due to this bug, 
the test was not really testing what it expected. The test-vertex's sleep value 
for the processor was not getting set properly and the fix started exercising 
code correctly causing failures. Request for review and comments post 
Precommit. Thanks a lot!

> TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS 
> correctly
> -
>
> Key: TEZ-3549
> URL: https://issues.apache.org/jira/browse/TEZ-3549
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3549.001.patch, TEZ-3549.002.patch, 
> TEZ-3549.003.patch
>
>
> {code}
> hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar  orderedwordcount  
> -Dtez.task.progress.stuck.interval-ms=70 /input  /output
> {code}
> With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=30}}
> fails as follows:
> {code}
>  INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
>  INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, 
> vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, 
> taskId=task_123_456_1_00_07, diagnostics=[TaskAttempt 0 failed, 
> info=[Attempt failed because it appears to make no progress for 70ms], 
> TaskAttempt 1 failed, info=[Attempt failed because it appears to make no 
> progress for 70ms], TaskAttempt 2 failed, info=[Attempt failed because it 
> appears to make no progress for 70ms], TaskAttempt 3 failed, 
> info=[Attempt failed because it appears to make no progress for 70ms]], 
> Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, 
> Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due 
> to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, 
> vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in 
> RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
> failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] 
> killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, 
> vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex 
> received Kill while in RUNNING state., Vertex did not succeed due to 
> OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 
> [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed 
> due to VERTEX_FAILURE. failedVertices:1 killedVertices:2]
> {code}
> This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 
> until progress gets notified and the following code always evaluates to true 
> if this config is set/non-zero.
> TaskUpdaterTransition:
> {code}
>   if (statusEvent.getProgressNotified()) {
> ta.lastNotifyProgressTimestamp = ta.clock.getTime();
>   } else {
> long currTime = ta.clock.getTime();
> if (ta.hungIntervalMax > 0 &&
> currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) {
> {code}
> Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so 
> that this comparison is valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly

2016-11-23 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated TEZ-3549:
-
Attachment: TEZ-3549.002.patch

Last patch was not for master. Uploading the master version.

> TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS 
> correctly
> -
>
> Key: TEZ-3549
> URL: https://issues.apache.org/jira/browse/TEZ-3549
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3549.001.patch, TEZ-3549.002.patch
>
>
> {code}
> hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar  orderedwordcount  
> -Dtez.task.progress.stuck.interval-ms=70 /input  /output
> {code}
> With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=30}}
> fails as follows:
> {code}
>  INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
>  INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, 
> vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, 
> taskId=task_123_456_1_00_07, diagnostics=[TaskAttempt 0 failed, 
> info=[Attempt failed because it appears to make no progress for 70ms], 
> TaskAttempt 1 failed, info=[Attempt failed because it appears to make no 
> progress for 70ms], TaskAttempt 2 failed, info=[Attempt failed because it 
> appears to make no progress for 70ms], TaskAttempt 3 failed, 
> info=[Attempt failed because it appears to make no progress for 70ms]], 
> Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, 
> Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due 
> to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, 
> vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in 
> RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
> failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] 
> killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, 
> vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex 
> received Kill while in RUNNING state., Vertex did not succeed due to 
> OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 
> [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed 
> due to VERTEX_FAILURE. failedVertices:1 killedVertices:2]
> {code}
> This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 
> until progress gets notified and the following code always evaluates to true 
> if this config is set/non-zero.
> TaskUpdaterTransition:
> {code}
>   if (statusEvent.getProgressNotified()) {
> ta.lastNotifyProgressTimestamp = ta.clock.getTime();
>   } else {
> long currTime = ta.clock.getTime();
> if (ta.hungIntervalMax > 0 &&
> currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) {
> {code}
> Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so 
> that this comparison is valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly

2016-11-23 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated TEZ-3549:
-
Attachment: TEZ-3549.001.patch

The patch sets the value of {{lastNotifyProgressTimestamp}} to current time 
during object creation. Includes a test.

> TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS 
> correctly
> -
>
> Key: TEZ-3549
> URL: https://issues.apache.org/jira/browse/TEZ-3549
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3549.001.patch
>
>
> {code}
> hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar  orderedwordcount  
> -Dtez.task.progress.stuck.interval-ms=70 /input  /output
> {code}
> With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=30}}
> fails as follows:
> {code}
>  INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
>  INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, 
> vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, 
> taskId=task_123_456_1_00_07, diagnostics=[TaskAttempt 0 failed, 
> info=[Attempt failed because it appears to make no progress for 70ms], 
> TaskAttempt 1 failed, info=[Attempt failed because it appears to make no 
> progress for 70ms], TaskAttempt 2 failed, info=[Attempt failed because it 
> appears to make no progress for 70ms], TaskAttempt 3 failed, 
> info=[Attempt failed because it appears to make no progress for 70ms]], 
> Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, 
> Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due 
> to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, 
> vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in 
> RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
> failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] 
> killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, 
> vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex 
> received Kill while in RUNNING state., Vertex did not succeed due to 
> OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 
> [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed 
> due to VERTEX_FAILURE. failedVertices:1 killedVertices:2]
> {code}
> This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 
> until progress gets notified and the following code always evaluates to true 
> if this config is set/non-zero.
> TaskUpdaterTransition:
> {code}
>   if (statusEvent.getProgressNotified()) {
> ta.lastNotifyProgressTimestamp = ta.clock.getTime();
>   } else {
> long currTime = ta.clock.getTime();
> if (ta.hungIntervalMax > 0 &&
> currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) {
> {code}
> Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so 
> that this comparison is valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)