[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2016-07-09 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369254#comment-15369254
 ] 

TezQA commented on TEZ-1019:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch
  against master revision 608e15e.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1839//console

This message is automatically generated.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2016-04-08 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233268#comment-15233268
 ] 

TezQA commented on TEZ-1019:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch
  against master revision 53981d4.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1644//console

This message is automatically generated.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2016-01-08 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090041#comment-15090041
 ] 

TezQA commented on TEZ-1019:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch
  against master revision 85637c6.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1414//console

This message is automatically generated.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-08-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715242#comment-14715242
 ] 

TezQA commented on TEZ-1019:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch
  against master revision eb70cb7.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1030//console

This message is automatically generated.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-04-27 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514826#comment-14514826
 ] 

TezQA commented on TEZ-1019:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch
  against master revision 21d4e2d.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/551//console

This message is automatically generated.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-02-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338228#comment-14338228
 ] 

Jeff Zhang commented on TEZ-1019:
-

[~hitesh] Thanks for review, attach new patch to address the review comments.

bq. in restoreFromEvent, the code goes through manually defined paths instead 
of using existing transition functions resulting in duplication of logic.
It is limited to the current recovering process. Currently, we use the below 
flow to recover 
DAG::restoreFromEvent ->  Vertex::restoreFromEvent -> Task::restoreFromEvent -> 
TaskAttempt::restoreFromEvent -> DAG::RecoveryTranstion -> 
Vertex::RecoveryTransition -> Task::RecoveryTransition -> 
TaskAttempt::RecoveryTransition 
So we have to manually call some function in Vertex::restoreFromEvent to create 
tasks otherwise Task::restoreFromEvent will throw NPE because task has not been 
created.
In theory, I think it is possible to completely align the recovery transition 
and normal transition.  For this, we need to refactor the current recovery 
process. TEZ-1657 is for this.
We can first consolidate all the recovery logs to DagRecoveryData, and use this 
data to recover the dag. And the dag will follow the normal state machine to 
transite, when it needs to recover its vertices, we just need to extract 
VertexRecoveryData from the DagRecoveryData and use it to recovery vertices. 
The same for the task and taskattempt.
DAG::RecoveryTransition -> Vertex::RecoveryTransition -> 
Task::RecoveryTransition -> TaskAttempt :: RecoveryTransition

But this change is too big, so I think we can put it in another jira. 




> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-02-25 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337350#comment-14337350
 ] 

Hitesh Shah commented on TEZ-1019:
--

Sorry for the delay in the review. I still need to do some more manual testing 
on this. 

Some general comments: 
   - routeRecoveredEvents still exists and is part of the recovery flow and 
needs to be kept in sync with the normal event flow. 
   - in restoreForEvent, the code goes through manually defined paths instead 
of using existing transition functions resulting in duplication of logic. 


> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-02-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313958#comment-14313958
 ] 

Hadoop QA commented on TEZ-1019:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12697692/TEZ-1019-5.patch
  against master revision 12c31ab.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/153//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/153//console

This message is automatically generated.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-02-10 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313885#comment-14313885
 ] 

Jeff Zhang commented on TEZ-1019:
-

This patch only partially resolve the refactoring of common code path for both 
normal and recovery flow. Changes are mainly in the RecoveryTransition, method 
restoreFromEvent still don't follow the state machine transition.
For TEZ-2006, this patch should be sufficient.  Just need to change the 
returned state to INITIALING / VM_IN_INITIALING in VertexImpl.InitTransition 
when it is in recovery and VertexInitialiazedEvent is seen.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> TEZ-1019-5.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-02-09 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313202#comment-14313202
 ] 

Bikas Saha commented on TEZ-1019:
-

Folks, TEZ-2066 depends on this jira because for that to be implemented, 
VertexImpl needs to go through state transitions like normal when executing 
recovery. Expecting this jira to fix and so I have marked it blocked by this 
one. If this is not the right jira then please link the correct jira to 
TEZ-2066. Thanks!

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-01-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296436#comment-14296436
 ] 

Hadoop QA commented on TEZ-1019:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12695190/TEZ-1019-4.patch
  against master revision e84c1aa.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/89//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/89//console

This message is automatically generated.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, TEZ-1019-4.patch, 
> Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-01-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294950#comment-14294950
 ] 

Hadoop QA commented on TEZ-1019:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12694948/TEZ-1019-3.patch
  against master revision 1e680a5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/83//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/83//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/83//console

This message is automatically generated.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, TEZ-1019-3.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2015-01-27 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294386#comment-14294386
 ] 

Hitesh Shah commented on TEZ-1019:
--

bq. Is there any real case in Pig/Hive that VM would set parallelism to 0 ?

Yes - if the data turns out to be 0 in size or the initializer realized that 
there is no data worth reading. 

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-12-18 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252782#comment-14252782
 ] 

Jeff Zhang commented on TEZ-1019:
-

bq. There is no guarantee that vertex running event was written in time ( given 
that it is not critical ) hence both the vertex start could have occurred as 
well tasks starting/finishing.
Yes, I know it is not written in time. But if the recoveredState is in INITED, 
that means the VertexStartedEvent and Task related event is not logged too. 
That means we have no Task to recover in this case.

bq. That should be the case in most scenarios. However, with allowing of -1 on 
1:1 edges and waiting for an upstream parallelism to be set to define the 
downstream vertex parallelism, we may need to verify all such cases. Also, in 
case of a parallelism update ( after running ), numTasks need not be set to 0 
but this could just be a sanity check to verify the tasks array matches 
numTasks.
Why we allow vertex go to RUNNING state with taskNum setting as -1 ? It makes 
no beneficial for that, since we still can not start any tasks when taskNum is 
-1.  

bq. numTasks 0 means vertex should go to a succeeded state. this might also 
happen if the vertex manager sets parallelism to 0
Is there any real case in Pig/Hive that VM would set parallelism to 0 ?




> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-12-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252654#comment-14252654
 ] 

Hitesh Shah commented on TEZ-1019:
--

bq. Regarding the succeeded case
  - numTasks 0 means vertex should go to a succeeded state
  - this might also happen if the vertex manager sets parallelism to 0

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-12-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252652#comment-14252652
 ] 

Hitesh Shah commented on TEZ-1019:
--

bq. In the existing code, we will recover task when vertex's recovered state is 
inited, not sure why, I just remove it in the new patch. As my understanding, 
if it is in INITED, there should be no task running, we don't need to recover 
task here. 
There is no guarantee that vertex running event was written in time ( given 
that it is not critical ) hence both the vertex start could have occurred as 
well tasks starting/finishing. 

bq. when vertex's recoveredState is RUNNING, we will still check the numTasks. 
As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise 
that means init is not completed.
That should be the case in most scenarios. However, with allowing of -1 on 1:1 
edges and waiting for an upstream parallelism to be set to define the 
downstream vertex parallelism, we may need to verify all such cases. Also, in 
case of a parallelism update ( after running ), numTasks need not be set to 0 
but this could just be a sanity check to verify the tasks array matches 
numTasks.



> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1019-2.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-12-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237608#comment-14237608
 ] 

Jeff Zhang commented on TEZ-1019:
-

Upload new path, [~hitesh] please help review it.


* The new patch change a lot on the recovery of vertex. I remove the RECOVERING 
state and trigger the recovery from root vertex. The down-stream vertex should 
be able start its own recovery automatically with the events from up-stream 
like in normal flow. I move the recovery work into normal transition (mainly in 
InitTransition & StartTransition). I just take the recovery events as the redo 
logs and use these recovery event to init and start vertex.
* I only make it pass TestAMRecovery and manually test some examples in 
tez-examples. ( TestVertexRecovery don't pass now, please just help review 
whether this approach work, whether I miss some cases. )
* Besides this, I have 2 questions about the vertex recovery
** In the existing code, we will recovery task when vertex's recovered state is 
inited, not sure why, I just remove it in the new patch.
** when vertex's recoveredState is RUNNING, we will still check the numTasks. 
As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise 
that means init is not completed.

{code}
  assert vertex.tasks.size() == vertex.numTasks;
  if (vertex.tasks != null && vertex.numTasks != 0) {
for (Task task : vertex.tasks.values()) {
  vertex.eventHandler.handle(
  new TaskEventRecoverTask(task.getTaskId()));
}
try {
  vertex.recoveryCodeSimulatingStart();
  endState = VertexState.RUNNING;
} catch (AMUserCodeException e) {
  String msg = "Exception in " + e.getSource() + ", vertex:" + 
vertex.getLogIdentifier();
  LOG.error(msg, e);
  vertex.finished(VertexState.FAILED, 
VertexTerminationCause.AM_USERCODE_FAILURE,
  msg + ", " + ExceptionUtils.getStackTrace(e.getCause()));
  endState = VertexState.FAILED;
}
  } else {
// why succeeded here
endState = VertexState.SUCCEEDED;
vertex.finished(endState);
  }
{code}

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
> Attachments: TEZ-1019-2.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-10-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164635#comment-14164635
 ] 

Bikas Saha commented on TEZ-1019:
-

bq. Do you mean reuse the state machines transition code when recovering ? Have 
investigated this before, need more time to get the code clean.
Yes.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
> Attachments: Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-10-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164442#comment-14164442
 ] 

Jeff Zhang commented on TEZ-1019:
-

[~bikassaha], attach one simple patch only to use the common code for Routing 
Event.

bq. This would mean recovery and normal mode take the state machines through 
the necessary transitions.
Do you mean reuse the state machines transition code when recovering ? Have 
investigated this before, need more time to get the code clean. 

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
> Attachments: Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-10-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164420#comment-14164420
 ] 

Bikas Saha commented on TEZ-1019:
-

This would mean recovery and normal mode take the state machines through the 
necessary transitions.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-09-09 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127908#comment-14127908
 ] 

Jeff Zhang commented on TEZ-1019:
-

[~bikassaha] Agree that delay it will make it harder to get cleaned up with 
time. I will work on this in the next 1 or 2 weeks. 

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1019) Re-factor routing of events to use common code path for normal and recovery flow.

2014-09-09 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127888#comment-14127888
 ] 

Bikas Saha commented on TEZ-1019:
-

[~hitesh] [~zjffdu] Any opinions on the priority of this jira wrt other 
advanced stuff/testing being done wrt recovery? Testing may not be affected if 
it considers only the external visible effects of recovery but adding more 
features to Tez may mean getting this cleaned up will be harder with time. As 
more events get added then maintaining this or adding new events will keep 
getting harder.

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> -
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)