[jira] [Commented] (TEZ-3430) Make split sorting optional
[ https://issues.apache.org/jira/browse/TEZ-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578954#comment-15578954 ] TezQA commented on TEZ-3430: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12828869/TEZ-3430.patch against master revision 43f7b5e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2040//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2040//console This message is automatically generated. > Make split sorting optional > --- > > Key: TEZ-3430 > URL: https://issues.apache.org/jira/browse/TEZ-3430 > Project: Apache Tez > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: TEZ-3430.patch > > > The fair routing design in TEZ-3209 addresses the skewed partitions where one > partition could be much larger than the others. But to simplify the stats > tracking, it assumes a given partition's data is distributed evenly to some > degree across source tasks so that it can group consecutive source tasks > together. > However, this assumption is invalid given {{MRInputHelpers}}'s > generateNewSplits and generateOldSplits sort the splits by size, thus the > data size in the beginning of source task range is bigger than that of at the > end. > {noformat} > Arrays.sort(splits, new InputSplitComparator()); > {noformat} > One way to fix this is to have fair routing track not only the aggregated > size of each partition, but also the size of each partition of each source > task. But that will significantly increase the memory footprint. > Alternatively, it can skip the sorting above. Test results for TEZ-3209 show > that jobs can finish 30% faster, given the source tasks output size is more > balanced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-3430 PreCommit Build #2040
Jira: https://issues.apache.org/jira/browse/TEZ-3430 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2040/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 4824 lines...] [INFO] Tez SUCCESS [ 0.025 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 55:41 min [INFO] Finished at: 2016-10-15T23:48:47+00:00 [INFO] Final Memory: 86M/1336M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12828869/TEZ-3430.patch against master revision 43f7b5e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2040//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2040//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 73bc7b8759216aa7ac9685a687d9d24f703b9018 logged out == == Finished build. == == Archiving artifacts [description-setter] Description set: TEZ-3430 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-3269) Provide basic fair routing and scheduling functionality via custom VertexManager and EdgeManager
[ https://issues.apache.org/jira/browse/TEZ-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-3269: Assignee: Ming Ma > Provide basic fair routing and scheduling functionality via custom > VertexManager and EdgeManager > > > Key: TEZ-3269 > URL: https://issues.apache.org/jira/browse/TEZ-3269 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: TEZ-3269-2.patch, TEZ-3269-3.patch, TEZ-3269.patch > > > With TEZ-3206 and TEZ-3216, we can build a custom VertexManager and > EdgeManager that uses partition stats to do fair routing as well as the > scheduling based on destination tasks’ dependency on source tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3430) Make split sorting optional
[ https://issues.apache.org/jira/browse/TEZ-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578851#comment-15578851 ] Siddharth Seth commented on TEZ-3430: - +1. Thanks [~mingma] > Make split sorting optional > --- > > Key: TEZ-3430 > URL: https://issues.apache.org/jira/browse/TEZ-3430 > Project: Apache Tez > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: TEZ-3430.patch > > > The fair routing design in TEZ-3209 addresses the skewed partitions where one > partition could be much larger than the others. But to simplify the stats > tracking, it assumes a given partition's data is distributed evenly to some > degree across source tasks so that it can group consecutive source tasks > together. > However, this assumption is invalid given {{MRInputHelpers}}'s > generateNewSplits and generateOldSplits sort the splits by size, thus the > data size in the beginning of source task range is bigger than that of at the > end. > {noformat} > Arrays.sort(splits, new InputSplitComparator()); > {noformat} > One way to fix this is to have fair routing track not only the aggregated > size of each partition, but also the size of each partition of each source > task. But that will significantly increase the memory footprint. > Alternatively, it can skip the sorting above. Test results for TEZ-3209 show > that jobs can finish 30% faster, given the source tasks output size is more > balanced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3215) Support for MultipleOutputs
[ https://issues.apache.org/jira/browse/TEZ-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578833#comment-15578833 ] Siddharth Seth commented on TEZ-3215: - Couple of minor comments. - Missing @override annotation on flush in MROutputs - newRecordWriter / oldRecordWriter will be setup when MROutput.initialize is called. Think this is avoidable. - Could be called MultiMROutput - similar to MultiMRInput (which deals with multiple readers). Up to you if you want to change this. Any changes required to the associated OutputCommitter? Otherwise, looks good to me. > Support for MultipleOutputs > --- > > Key: TEZ-3215 > URL: https://issues.apache.org/jira/browse/TEZ-3215 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: TEZ-3215-2.patch, TEZ-3215-3.patch, TEZ-3215-4.patch, > TEZ-3215-5.patch, TEZ-3215.patch > > > Here is the use case. A reducer might write its output to more than one file. > The file name will be based on the mapper key. We don't know all possible > keys ahead of time. In MR, MultipleOutputs provides such support. I couldn't > find anything readily available in Tez. > * Set up one DataSink per file ahead of time won't work as we don't know all > possible keys. > * Use MR MultipleOutputs directly from the Tez application processor. It > isn't clear how to pass TaskInputOutputContext to MultipleOutputs. > * Tez MROutput can create a DataSink based on the specified outputFormat. But > it can't take MR MultipleOutputs. > I end up modifying Tez MROutput with HashMap {{recordWriters}} to achieve > this. If this is a solved problem, can anyone explain how to do it? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3269) Provide basic fair routing and scheduling functionality via custom VertexManager and EdgeManager
[ https://issues.apache.org/jira/browse/TEZ-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578791#comment-15578791 ] Siddharth Seth commented on TEZ-3269: - Apologies for the long delay in the review. Mostly looks good. Would be a lot easier to review if this were split into smaller jiras... think it combines a bunch of things like long to int, with the core logic changes. Minor Stuff: - final where possible - e.g. PartitionsGroupingCalculator.sourceVertexInfo, all variables in FairEdgeConfiguration - This is a fairly complicated patch. Would be good to have some more documentation. - the ceil method - Within various conditions in compute and iterator - Obligatory rename request: getTotalStatsAtIndex to getCurrentlyKnownStatsAtIndex - this method will normally not return totalStats. - Nit: expectedTotalSourceTasksOutputSize / numOfPartitions; - can be done once outside the loop - onVertexStarted - Should this be split up a little more. It's possible for quite a bit to happen at the moment, before the "single vertex only" check is hit in FairShufflleVertexManager Question: - estimatePartitionSize.partitionstatSizeInMB is across all partitions. This ensures that averaging of stats based on output size isn't accidentally hit on a 0 sized partition? (Could break earlier from the loop) - In case of reduce_parallelism - this considers the partition size and may produce groups with different number of partitions to consume, which the current ShuffleVertexManager doesn't do yet? - Will the parallelism ever end up getting increased? Any thoughts on what it will take to move this to support multiple source vertices ? > Provide basic fair routing and scheduling functionality via custom > VertexManager and EdgeManager > > > Key: TEZ-3269 > URL: https://issues.apache.org/jira/browse/TEZ-3269 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Ming Ma > Attachments: TEZ-3269-2.patch, TEZ-3269-3.patch, TEZ-3269.patch > > > With TEZ-3206 and TEZ-3216, we can build a custom VertexManager and > EdgeManager that uses partition stats to do fair routing as well as the > scheduling based on destination tasks’ dependency on source tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)