[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369254#comment-15369254 ] TezQA commented on TEZ-1019: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch against master revision 608e15e. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1839//console This message is automatically generated. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233268#comment-15233268 ] TezQA commented on TEZ-1019: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch against master revision 53981d4. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1644//console This message is automatically generated. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090041#comment-15090041 ] TezQA commented on TEZ-1019: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch against master revision 85637c6. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1414//console This message is automatically generated. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715242#comment-14715242 ] TezQA commented on TEZ-1019: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch against master revision eb70cb7. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1030//console This message is automatically generated. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514826#comment-14514826 ] TezQA commented on TEZ-1019: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch against master revision 21d4e2d. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/551//console This message is automatically generated. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338228#comment-14338228 ] Jeff Zhang commented on TEZ-1019: - [~hitesh] Thanks for review, attach new patch to address the review comments. bq. in restoreFromEvent, the code goes through manually defined paths instead of using existing transition functions resulting in duplication of logic. It is limited to the current recovering process. Currently, we use the below flow to recover DAG::restoreFromEvent -> Vertex::restoreFromEvent -> Task::restoreFromEvent -> TaskAttempt::restoreFromEvent -> DAG::RecoveryTranstion -> Vertex::RecoveryTransition -> Task::RecoveryTransition -> TaskAttempt::RecoveryTransition So we have to manually call some function in Vertex::restoreFromEvent to create tasks otherwise Task::restoreFromEvent will throw NPE because task has not been created. In theory, I think it is possible to completely align the recovery transition and normal transition. For this, we need to refactor the current recovery process. TEZ-1657 is for this. We can first consolidate all the recovery logs to DagRecoveryData, and use this data to recover the dag. And the dag will follow the normal state machine to transite, when it needs to recover its vertices, we just need to extract VertexRecoveryData from the DagRecoveryData and use it to recovery vertices. The same for the task and taskattempt. DAG::RecoveryTransition -> Vertex::RecoveryTransition -> Task::RecoveryTransition -> TaskAttempt :: RecoveryTransition But this change is too big, so I think we can put it in another jira. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337350#comment-14337350 ] Hitesh Shah commented on TEZ-1019: -- Sorry for the delay in the review. I still need to do some more manual testing on this. Some general comments: - routeRecoveredEvents still exists and is part of the recovery flow and needs to be kept in sync with the normal event flow. - in restoreForEvent, the code goes through manually defined paths instead of using existing transition functions resulting in duplication of logic. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313958#comment-14313958 ] Hadoop QA commented on TEZ-1019: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch against master revision 12c31ab. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/153//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/153//console This message is automatically generated. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313885#comment-14313885 ] Jeff Zhang commented on TEZ-1019: - This patch only partially resolve the refactoring of common code path for both normal and recovery flow. Changes are mainly in the RecoveryTransition, method restoreFromEvent still don't follow the state machine transition. For TEZ-2006, this patch should be sufficient. Just need to change the returned state to INITIALING / VM_IN_INITIALING in VertexImpl.InitTransition when it is in recovery and VertexInitialiazedEvent is seen. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > TEZ-1019-5.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313202#comment-14313202 ] Bikas Saha commented on TEZ-1019: - Folks, TEZ-2066 depends on this jira because for that to be implemented, VertexImpl needs to go through state transitions like normal when executing recovery. Expecting this jira to fix and so I have marked it blocked by this one. If this is not the right jira then please link the correct jira to TEZ-2066. Thanks! > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296436#comment-14296436 ] Hadoop QA commented on TEZ-1019: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12695190/TEZ-1019-4.patch against master revision e84c1aa. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/89//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/89//console This message is automatically generated. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, > Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294950#comment-14294950 ] Hadoop QA commented on TEZ-1019: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12694948/TEZ-1019-3.patch against master revision 1e680a5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/83//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/83//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/83//console This message is automatically generated. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294386#comment-14294386 ] Hitesh Shah commented on TEZ-1019: -- bq. Is there any real case in Pig/Hive that VM would set parallelism to 0 ? Yes - if the data turns out to be 0 in size or the initializer realized that there is no data worth reading. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252782#comment-14252782 ] Jeff Zhang commented on TEZ-1019: - bq. There is no guarantee that vertex running event was written in time ( given that it is not critical ) hence both the vertex start could have occurred as well tasks starting/finishing. Yes, I know it is not written in time. But if the recoveredState is in INITED, that means the VertexStartedEvent and Task related event is not logged too. That means we have no Task to recover in this case. bq. That should be the case in most scenarios. However, with allowing of -1 on 1:1 edges and waiting for an upstream parallelism to be set to define the downstream vertex parallelism, we may need to verify all such cases. Also, in case of a parallelism update ( after running ), numTasks need not be set to 0 but this could just be a sanity check to verify the tasks array matches numTasks. Why we allow vertex go to RUNNING state with taskNum setting as -1 ? It makes no beneficial for that, since we still can not start any tasks when taskNum is -1. bq. numTasks 0 means vertex should go to a succeeded state. this might also happen if the vertex manager sets parallelism to 0 Is there any real case in Pig/Hive that VM would set parallelism to 0 ? > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252654#comment-14252654 ] Hitesh Shah commented on TEZ-1019: -- bq. Regarding the succeeded case - numTasks 0 means vertex should go to a succeeded state - this might also happen if the vertex manager sets parallelism to 0 > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252652#comment-14252652 ] Hitesh Shah commented on TEZ-1019: -- bq. In the existing code, we will recover task when vertex's recovered state is inited, not sure why, I just remove it in the new patch. As my understanding, if it is in INITED, there should be no task running, we don't need to recover task here. There is no guarantee that vertex running event was written in time ( given that it is not critical ) hence both the vertex start could have occurred as well tasks starting/finishing. bq. when vertex's recoveredState is RUNNING, we will still check the numTasks. As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise that means init is not completed. That should be the case in most scenarios. However, with allowing of -1 on 1:1 edges and waiting for an upstream parallelism to be set to define the downstream vertex parallelism, we may need to verify all such cases. Also, in case of a parallelism update ( after running ), numTasks need not be set to 0 but this could just be a sanity check to verify the tasks array matches numTasks. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1019-2.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237608#comment-14237608 ] Jeff Zhang commented on TEZ-1019: - Upload new path, [~hitesh] please help review it. * The new patch change a lot on the recovery of vertex. I remove the RECOVERING state and trigger the recovery from root vertex. The down-stream vertex should be able start its own recovery automatically with the events from up-stream like in normal flow. I move the recovery work into normal transition (mainly in InitTransition & StartTransition). I just take the recovery events as the redo logs and use these recovery event to init and start vertex. * I only make it pass TestAMRecovery and manually test some examples in tez-examples. ( TestVertexRecovery don't pass now, please just help review whether this approach work, whether I miss some cases. ) * Besides this, I have 2 questions about the vertex recovery ** In the existing code, we will recovery task when vertex's recovered state is inited, not sure why, I just remove it in the new patch. ** when vertex's recoveredState is RUNNING, we will still check the numTasks. As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise that means init is not completed. {code} assert vertex.tasks.size() == vertex.numTasks; if (vertex.tasks != null && vertex.numTasks != 0) { for (Task task : vertex.tasks.values()) { vertex.eventHandler.handle( new TaskEventRecoverTask(task.getTaskId())); } try { vertex.recoveryCodeSimulatingStart(); endState = VertexState.RUNNING; } catch (AMUserCodeException e) { String msg = "Exception in " + e.getSource() + ", vertex:" + vertex.getLogIdentifier(); LOG.error(msg, e); vertex.finished(VertexState.FAILED, VertexTerminationCause.AM_USERCODE_FAILURE, msg + ", " + ExceptionUtils.getStackTrace(e.getCause())); endState = VertexState.FAILED; } } else { // why succeeded here endState = VertexState.SUCCEEDED; vertex.finished(endState); } {code} > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah > Attachments: TEZ-1019-2.patch, Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164635#comment-14164635 ] Bikas Saha commented on TEZ-1019: - bq. Do you mean reuse the state machines transition code when recovering ? Have investigated this before, need more time to get the code clean. Yes. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah > Attachments: Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164442#comment-14164442 ] Jeff Zhang commented on TEZ-1019: - [~bikassaha], attach one simple patch only to use the common code for Routing Event. bq. This would mean recovery and normal mode take the state machines through the necessary transitions. Do you mean reuse the state machines transition code when recovering ? Have investigated this before, need more time to get the code clean. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah > Attachments: Tez-1019.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164420#comment-14164420 ] Bikas Saha commented on TEZ-1019: - This would mean recovery and normal mode take the state machines through the necessary transitions. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127908#comment-14127908 ] Jeff Zhang commented on TEZ-1019: - [~bikassaha] Agree that delay it will make it harder to get cleaned up with time. I will work on this in the next 1 or 2 weeks. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.
[ https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127888#comment-14127888 ] Bikas Saha commented on TEZ-1019: - [~hitesh] [~zjffdu] Any opinions on the priority of this jira wrt other advanced stuff/testing being done wrt recovery? Testing may not be affected if it considers only the external visible effects of recovery but adding more features to Tez may mean getting this cleaned up will be harder with time. As more events get added then maintaining this or adding new events will keep getting harder. > Re-factor routing of events to use common code path for normal and recovery > flow. > - > > Key: TEZ-1019 > URL: https://issues.apache.org/jira/browse/TEZ-1019 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah > -- This message was sent by Atlassian JIRA (v6.3.4#6332)