[jira] [Updated] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly
[ https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3549: - Attachment: TEZ-3549-branch-0.7.001.patch Thanks [~jeagles]. Uploading branch-0.7 version of the patch. > TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS > correctly > - > > Key: TEZ-3549 > URL: https://issues.apache.org/jira/browse/TEZ-3549 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Fix For: 0.7.2, 0.9.0, 0.8.5 > > Attachments: TEZ-3549-branch-0.7.001.patch, TEZ-3549.001.patch, > TEZ-3549.002.patch, TEZ-3549.003.patch > > > {code} > hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar orderedwordcount > -Dtez.task.progress.stuck.interval-ms=70 /input /output > {code} > With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=30}} > fails as follows: > {code} > INFO client.DAGClientImpl: DAG completed. FinalState=FAILED > INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, > vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, > taskId=task_123_456_1_00_07, diagnostics=[TaskAttempt 0 failed, > info=[Attempt failed because it appears to make no progress for 70ms], > TaskAttempt 1 failed, info=[Attempt failed because it appears to make no > progress for 70ms], TaskAttempt 2 failed, info=[Attempt failed because it > appears to make no progress for 70ms], TaskAttempt 3 failed, > info=[Attempt failed because it appears to make no progress for 70ms]], > Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, > Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due > to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, > vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in > RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, > failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] > killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, > vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex > received Kill while in RUNNING state., Vertex did not succeed due to > OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 > [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed > due to VERTEX_FAILURE. failedVertices:1 killedVertices:2] > {code} > This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 > until progress gets notified and the following code always evaluates to true > if this config is set/non-zero. > TaskUpdaterTransition: > {code} > if (statusEvent.getProgressNotified()) { > ta.lastNotifyProgressTimestamp = ta.clock.getTime(); > } else { > long currTime = ta.clock.getTime(); > if (ta.hungIntervalMax > 0 && > currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) { > {code} > Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so > that this comparison is valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly
[ https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3549: - Attachment: TEZ-3549.003.patch Thanks [~jeagles]. Updated patch to address the test failure. Due to this bug, the test was not really testing what it expected. The test-vertex's sleep value for the processor was not getting set properly and the fix started exercising code correctly causing failures. Request for review and comments post Precommit. Thanks a lot! > TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS > correctly > - > > Key: TEZ-3549 > URL: https://issues.apache.org/jira/browse/TEZ-3549 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3549.001.patch, TEZ-3549.002.patch, > TEZ-3549.003.patch > > > {code} > hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar orderedwordcount > -Dtez.task.progress.stuck.interval-ms=70 /input /output > {code} > With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=30}} > fails as follows: > {code} > INFO client.DAGClientImpl: DAG completed. FinalState=FAILED > INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, > vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, > taskId=task_123_456_1_00_07, diagnostics=[TaskAttempt 0 failed, > info=[Attempt failed because it appears to make no progress for 70ms], > TaskAttempt 1 failed, info=[Attempt failed because it appears to make no > progress for 70ms], TaskAttempt 2 failed, info=[Attempt failed because it > appears to make no progress for 70ms], TaskAttempt 3 failed, > info=[Attempt failed because it appears to make no progress for 70ms]], > Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, > Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due > to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, > vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in > RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, > failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] > killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, > vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex > received Kill while in RUNNING state., Vertex did not succeed due to > OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 > [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed > due to VERTEX_FAILURE. failedVertices:1 killedVertices:2] > {code} > This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 > until progress gets notified and the following code always evaluates to true > if this config is set/non-zero. > TaskUpdaterTransition: > {code} > if (statusEvent.getProgressNotified()) { > ta.lastNotifyProgressTimestamp = ta.clock.getTime(); > } else { > long currTime = ta.clock.getTime(); > if (ta.hungIntervalMax > 0 && > currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) { > {code} > Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so > that this comparison is valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly
[ https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3549: - Attachment: TEZ-3549.002.patch Last patch was not for master. Uploading the master version. > TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS > correctly > - > > Key: TEZ-3549 > URL: https://issues.apache.org/jira/browse/TEZ-3549 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3549.001.patch, TEZ-3549.002.patch > > > {code} > hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar orderedwordcount > -Dtez.task.progress.stuck.interval-ms=70 /input /output > {code} > With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=30}} > fails as follows: > {code} > INFO client.DAGClientImpl: DAG completed. FinalState=FAILED > INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, > vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, > taskId=task_123_456_1_00_07, diagnostics=[TaskAttempt 0 failed, > info=[Attempt failed because it appears to make no progress for 70ms], > TaskAttempt 1 failed, info=[Attempt failed because it appears to make no > progress for 70ms], TaskAttempt 2 failed, info=[Attempt failed because it > appears to make no progress for 70ms], TaskAttempt 3 failed, > info=[Attempt failed because it appears to make no progress for 70ms]], > Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, > Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due > to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, > vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in > RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, > failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] > killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, > vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex > received Kill while in RUNNING state., Vertex did not succeed due to > OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 > [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed > due to VERTEX_FAILURE. failedVertices:1 killedVertices:2] > {code} > This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 > until progress gets notified and the following code always evaluates to true > if this config is set/non-zero. > TaskUpdaterTransition: > {code} > if (statusEvent.getProgressNotified()) { > ta.lastNotifyProgressTimestamp = ta.clock.getTime(); > } else { > long currTime = ta.clock.getTime(); > if (ta.hungIntervalMax > 0 && > currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) { > {code} > Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so > that this comparison is valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly
[ https://issues.apache.org/jira/browse/TEZ-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3549: - Attachment: TEZ-3549.001.patch The patch sets the value of {{lastNotifyProgressTimestamp}} to current time during object creation. Includes a test. > TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS > correctly > - > > Key: TEZ-3549 > URL: https://issues.apache.org/jira/browse/TEZ-3549 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3549.001.patch > > > {code} > hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar orderedwordcount > -Dtez.task.progress.stuck.interval-ms=70 /input /output > {code} > With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=30}} > fails as follows: > {code} > INFO client.DAGClientImpl: DAG completed. FinalState=FAILED > INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, > vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, > taskId=task_123_456_1_00_07, diagnostics=[TaskAttempt 0 failed, > info=[Attempt failed because it appears to make no progress for 70ms], > TaskAttempt 1 failed, info=[Attempt failed because it appears to make no > progress for 70ms], TaskAttempt 2 failed, info=[Attempt failed because it > appears to make no progress for 70ms], TaskAttempt 3 failed, > info=[Attempt failed because it appears to make no progress for 70ms]], > Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, > Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due > to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, > vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in > RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, > failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] > killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, > vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex > received Kill while in RUNNING state., Vertex did not succeed due to > OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 > [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed > due to VERTEX_FAILURE. failedVertices:1 killedVertices:2] > {code} > This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 > until progress gets notified and the following code always evaluates to true > if this config is set/non-zero. > TaskUpdaterTransition: > {code} > if (statusEvent.getProgressNotified()) { > ta.lastNotifyProgressTimestamp = ta.clock.getTime(); > } else { > long currTime = ta.clock.getTime(); > if (ta.hungIntervalMax > 0 && > currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) { > {code} > Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so > that this comparison is valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)