[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674696#comment-15674696 ] Rohini Palaniswamy commented on TEZ-391: [~bikassaha], Possible to make time to get this also into Tez 0.9? > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369253#comment-15369253 ] TezQA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734397/TEZ-391-WIP-7.patch against master revision 608e15e. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1838//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233267#comment-15233267 ] TezQA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734397/TEZ-391-WIP-7.patch against master revision 53981d4. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1643//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090040#comment-15090040 ] TezQA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734397/TEZ-391-WIP-7.patch against master revision 85637c6. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1413//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715240#comment-14715240 ] TezQA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734397/TEZ-391-WIP-7.patch against master revision eb70cb7. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1029//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569873#comment-14569873 ] Hitesh Shah commented on TEZ-391: - ping [~bikassaha] for review > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554104#comment-14554104 ] TezQA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734397/TEZ-391-WIP-7.patch against master revision aa6a84c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 11 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 162 javac compiler warnings (more than the master's current 161 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/719//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/719//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/719//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553819#comment-14553819 ] TezQA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734339/TEZ-391-WIP-7.patch against master revision 7c16b10. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 11 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 162 javac compiler warnings (more than the master's current 161 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/717//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/717//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/717//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553693#comment-14553693 ] Jeff Zhang commented on TEZ-391: The following shows the different edge types we may need to support. | | Vertex | VertexGroup | | Vertex | Common Edge | SharedOutputEdge | | VertexGroup | GroupInputEdge | Both SharedOutputEdge & GroupInputEdge (not implemented yet ) | List several main changes of this patch * Currently SharedOutputEdge only support One-to-One and Broadcast (ScatterGather require the 2 downstream vertices has the same parallelism, otherwise shuffle will break. Although I did some change to make the ScatterGather work, but it still need more work, especially on the reducer auto-parallelism) From the pig's usage scenario, One-to-One and broadcast should be sufficient now. * Work flow for shared output edge ** Specify the shared output edge when building DAG on client. ** AM get the shared output edge from DAGPlan and pass this SharedOutputSpec through TaskSpec to TezChild ** LogicalIOProcessorRuntimeTask get the TaskSpec which contains the SharedOutputSpec. It would created corresponded SharedLogicOutput & SharedOutputContext which is very similar to common LogicOutput & OutputContext. The only difference is that SharedLogicOutput & SharedOutputContext is associated with the downstream vertex group name rather than the downstream vertex name. The key thing here is that although we generate one copy of DatamovementEvent but we will send this one copy to each members of the downstream vertex group. (This is done in LogicalIOProcessorRuntimeTask.close()) * Refactor changes ** I rename lots of MergedInput to GroupedInput to make it align with SharedOutput ** Rename VertexImpl#sharedOutput to VertexImpl#mergedOutput > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515040#comment-14515040 ] Bikas Saha commented on TEZ-391: [~zjffdu] Can you make a call on whether this is for 0.7.0 or not? IMO, if this was close to being done then perhaps yes. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514688#comment-14514688 ] Hitesh Shah commented on TEZ-391: - [~bikassaha] [~zjffdu] Is this for 0.7? > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508396#comment-14508396 ] TezQA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12727512/TEZ-391-WIP-6.patch against master revision fe11c5e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 161 javac compiler warnings (more than the master's current 160 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/520//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/520//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/520//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508332#comment-14508332 ] TezQA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12727511/TEZ-391-WIP-5.patch against master revision fe11c5e. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/519//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch, > TEZ-391-WIP-5.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338237#comment-14338237 ] Hadoop QA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701014/TEZ-391-WIP-4.patch against master revision 1ccb0be. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 181 javac compiler warnings (more than the master's current 180 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/231//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/231//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/231//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306425#comment-14306425 ] Jeff Zhang commented on TEZ-391: bq. Sounds good. But can we call it SharedOutputEdge instead of ShareOutputEdge? Sure. bq. We already use GroupInputEdge in pig. Refer to TezDAGBuilder. Not sure how you can set up the edge for Vertex Group without that as the mergedinput descriptor needs to be set for it. This may need some api changes. As my understanding, the input descriptor depends on the edge property. So the following API should be sufficient for creating any kind of edges. Anyway, since this would change the api, it is just a proposal , won't do it this jira. {code} Edge.create(vertex/vertexgroup, vertex/vertexgroup, edge_property) {code} > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305808#comment-14305808 ] Rohini Palaniswamy commented on TEZ-391: bq. I still think ShareOutputEdge is more suitable. Because for GroupInputEdge, there's multiple inputs from upstream vertices, we group them together into GroupInput. While for ShareOutputEdge, there's actually only one output from upstream vertex. So from semantic perspective I think ShareOutputEdge is better. Sounds good. But can we call it SharedOutputEdge instead of ShareOutputEdge? bq. Besides, I am thinking is it necessary to expose the GroupInputEdge/ShareOutputEdge as public API. IMO, I don't think it is necessary. We already use GroupInputEdge in pig. Refer to TezDAGBuilder. Not sure how you can set up the edge for Vertex Group without that as the mergedinput descriptor needs to be set for it. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304797#comment-14304797 ] Jeff Zhang commented on TEZ-391: bq. We should probably name it GroupOutputEdge to be symmetric with GroupInputEdge. I still think ShareOutputEdge is more suitable. Because for GroupInputEdge, there's multiple inputs from upstream vertices, we group them together into GroupInput. While for ShareOutputEdge, there's actually only one output from upstream vertex. So from semantic perspective I think ShareOutputEdge is better. Besides, there's one concept of SharedOutput in VertexImpl (VertexImpl:: addSharedOutputs ) for output to a data sink. I think this kind of output be renamed as GroupOutput is much better. bq. Is the design suggesting that a group output edge expand into standard edges with additional metadata at the source vertex which will enable its TezChild to provide a single output to its tasks even though there are multiple consumers? Yes, TezChild only has one output but would send multiple events to AM based on the additional metadata about the share edge. bq. What happens to fault tolerance? If a destination vertex reports an error about a shared source then what should happen in other destination vertices that are sharing that source? The upstream vertex will get the the InputReadErrorEvent and would send the InputFailedEvent to both downstream vertices. In theory it should be no problem. But you are right, I think I need to highlight these case and verify it in unittest. bq. Related: When an output of a task is marked bad then it sends an InputFailed event to its destination tasks. This happens in the AM and needs to be sent to all destination tasks of a shared output. So the AM routing would need to take into account shared outputs for this case. For the AM, it knows the standard edges that are expanded from share edge. so all the downstream vertices will get the InputFailed event. bq. Can it happen that a VertexGroup is connected to another VertexGroup? What use case would that be? Good question. This case would be 2 union join together and one of them is replicated part. In this case the edges between these vertex group would be both GroupInputEdge and ShareOutputEdge. Need to look into it more deeply. {code} a = load 'file:///tmp/input' as (x:int, y:chararray); b = load 'file:///tmp/input' as (y:chararray, x:int); c = union onschema a, b; d = load 'file:///tmp/input1' as (x:int, z:chararray); e = load 'file:///tmp/input2' as (x:int, z:chararray); f = union onschema d,e; g = join c by x, d by f using 'replicated'; store g into 'file:///tmp/output'; {code} Besides, I am thinking is it necessary to expose the GroupInputEdge/ShareOutputEdge as public API. User just need to create edge by connecting one Vertex/VertexGroup and another Vertex/VertexGroup (2 by 2 cases)., * If the destination is vertex group, then that mean they share the one copy of output from source no matter the source is vertex or vertex group. * Meanwhile, If the source is vertex group, then that mean destination use the merged input from the destination no matter the destination is vertex or vertex group. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304395#comment-14304395 ] Bikas Saha commented on TEZ-391: Thanks for the doc. It gives a good overall picture. We should probably name it GroupOutputEdge to be symmetric with GroupInputEdge. Some parts are not clear. E.g. a group input edge actually expands into a set of standard edges a long with the some metadata on the destination vertex that enables the AM to send additional info to the destination TezChild. TezChild uses the metadata to create a unified input that wraps around the member inputs. This provides a merged view on top of the real inputs and failure handling remains as is. Is the design suggesting that a group output edge expand into standard edges with additional metadata at the source vertex which will enable its TezChild to provide a single output to its tasks even though there are multiple consumers? The replicate event at TezChild vs keep it single needs some more thought. E.g. replication would increase event memory by replica times. What happens to fault tolerance? If a destination vertex reports an error about a shared source then what should happen in other destination vertices that are sharing that source? Related: When an output of a task is marked bad then it sends an InputFailed event to its destination tasks. This happens in the AM and needs to be sent to all destination tasks of a shared output. So the AM routing would need to take into account shared outputs for this case. The OutputReportedFailedTransition may need to be updated to consider the case the errors may be reported from multiple vertices with different task counts. Shared output to a data sink was already covered in the jira that added GroupInputEdge. So we can skip that here. Can it happen that a VertexGroup is connected to another VertexGroup? What use case would that be? Until now standard vertices would be inputs a VertexGroup. Shared edge will allow VertexGroups to be outputs to standard vertices. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14300856#comment-14300856 ] Hadoop QA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12695866/Shared%20Edge%20Design.pdf against master revision cfa637a. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/110//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291652#comment-14291652 ] Hadoop QA commented on TEZ-391: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12694458/TEZ-391-WIP-3.patch against master revision 12e1e66. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/77//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/77//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: TEZ-391-WIP-1.patch, TEZ-391-WIP-2.patch, > TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291616#comment-14291616 ] Jeff Zhang commented on TEZ-391: [~bikassaha] Thanks for your comments. I will attach a design doc later for your review. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: TEZ-391-WIP-1.patch, TEZ-391-WIP-2.patch, > TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291516#comment-14291516 ] Bikas Saha commented on TEZ-391: I am glad that you have come to the conclusion that VertexGroup can be symmetrically used to created shared outputs in the same manner as it is currently used to create shared inputs. I had thought about shared edge implementation after this jira was opened and this seemed like the most natural solution. I should have noted that down in a design note earlier but looks like we are on the same page. Before going down the implementation path, it would be great if you could leave a design note that outline the flow - from API spec to how the logic flows through to the tasks. This will help clear out the design and enabled others to understand it better. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: TEZ-391-WIP-1.patch, TEZ-391-WIP-2.patch, > TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287210#comment-14287210 ] Hadoop QA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12693848/TEZ-391-WIP-2.patch against master revision 3f4e8a7. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/71//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: TEZ-391-WIP-1.patch, TEZ-391-WIP-2.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282411#comment-14282411 ] Hadoop QA commented on TEZ-391: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12693059/TEZ-391-WIP-1.patch against master revision c684653. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 64 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/52//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/52//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/52//artifact/patchprocess/newPatchFindbugsWarningstez-mapreduce.html Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/52//artifact/patchprocess/newPatchFindbugsWarningstez-examples.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/52//console This message is automatically generated. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: TEZ-391-WIP-1.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-391) SharedEdge - Support for passing same output from a vertex as input to two different vertices
[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282355#comment-14282355 ] Jeff Zhang commented on TEZ-391: Attach patch for SharedEdge * Add a new api in Edge to create shared edge {code} public Edge createSharedEdge(Vertex outputVertex) {code} * Currently it only support One-to-One and Broadcast (ScatterGather require the 2 downstream vertices has the same parallelism, otherwise shuffle will break. Although I did some change to make the ScatterGather work, but it still need more work, especially on the reducer auto-parallelism) * Add one example in tez-example to show the usage. (SharedEdgeExample) Although this patch works, after more thinking, I think using VertexGroup may be more natural and easy to understand. (We just need to make the 2 downstream vertices as a vertex group and connect the upstream vertex with this vertex group) VertexGroup is now used for shared output, it is also natural to make it support for shared input. I will attach a new patch by using VertexGroup later. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > - > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jeff Zhang > Attachments: TEZ-391-WIP-1.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)