[jira] [Commented] (TEZ-3430) Make split sorting optional

2016-10-15 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578954#comment-15578954
 ] 

TezQA commented on TEZ-3430:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12828869/TEZ-3430.patch
  against master revision 43f7b5e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2040//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2040//console

This message is automatically generated.

> Make split sorting optional
> ---
>
> Key: TEZ-3430
> URL: https://issues.apache.org/jira/browse/TEZ-3430
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: TEZ-3430.patch
>
>
> The fair routing design in TEZ-3209 addresses the skewed partitions where one 
> partition could be much larger than the others. But to simplify the stats 
> tracking, it assumes a given partition's data is distributed evenly to some 
> degree across source tasks so that it can group consecutive source tasks 
> together.
> However, this assumption is invalid given {{MRInputHelpers}}'s 
> generateNewSplits and generateOldSplits sort the splits by size, thus the 
> data size in the beginning of source task range is bigger than that of at the 
> end.
> {noformat}
> Arrays.sort(splits, new InputSplitComparator());
> {noformat}
> One way to fix this is to have fair routing track not only the aggregated 
> size of each partition, but also the size of each partition of each source 
> task. But that will significantly increase the memory footprint.
> Alternatively, it can skip the sorting above. Test results for TEZ-3209 show 
> that jobs can finish 30% faster, given the source tasks output size is more 
> balanced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-3430 PreCommit Build #2040

2016-10-15 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3430
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2040/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 4824 lines...]
[INFO] Tez  SUCCESS [  0.025 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 55:41 min
[INFO] Finished at: 2016-10-15T23:48:47+00:00
[INFO] Final Memory: 86M/1336M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12828869/TEZ-3430.patch
  against master revision 43f7b5e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2040//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2040//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
73bc7b8759216aa7ac9685a687d9d24f703b9018 logged out


==
==
Finished build.
==
==


Archiving artifacts
[description-setter] Description set: TEZ-3430
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Updated] (TEZ-3269) Provide basic fair routing and scheduling functionality via custom VertexManager and EdgeManager

2016-10-15 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-3269:

Assignee: Ming Ma

> Provide basic fair routing and scheduling functionality via custom 
> VertexManager and EdgeManager
> 
>
> Key: TEZ-3269
> URL: https://issues.apache.org/jira/browse/TEZ-3269
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: TEZ-3269-2.patch, TEZ-3269-3.patch, TEZ-3269.patch
>
>
> With TEZ-3206 and TEZ-3216, we can build a custom VertexManager and 
> EdgeManager that uses partition stats to do fair routing as well as the 
> scheduling based on destination tasks’ dependency on source tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3430) Make split sorting optional

2016-10-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578851#comment-15578851
 ] 

Siddharth Seth commented on TEZ-3430:
-

+1. Thanks [~mingma]

> Make split sorting optional
> ---
>
> Key: TEZ-3430
> URL: https://issues.apache.org/jira/browse/TEZ-3430
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: TEZ-3430.patch
>
>
> The fair routing design in TEZ-3209 addresses the skewed partitions where one 
> partition could be much larger than the others. But to simplify the stats 
> tracking, it assumes a given partition's data is distributed evenly to some 
> degree across source tasks so that it can group consecutive source tasks 
> together.
> However, this assumption is invalid given {{MRInputHelpers}}'s 
> generateNewSplits and generateOldSplits sort the splits by size, thus the 
> data size in the beginning of source task range is bigger than that of at the 
> end.
> {noformat}
> Arrays.sort(splits, new InputSplitComparator());
> {noformat}
> One way to fix this is to have fair routing track not only the aggregated 
> size of each partition, but also the size of each partition of each source 
> task. But that will significantly increase the memory footprint.
> Alternatively, it can skip the sorting above. Test results for TEZ-3209 show 
> that jobs can finish 30% faster, given the source tasks output size is more 
> balanced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3215) Support for MultipleOutputs

2016-10-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578833#comment-15578833
 ] 

Siddharth Seth commented on TEZ-3215:
-

Couple of minor comments.
- Missing @override annotation on flush in MROutputs
- newRecordWriter / oldRecordWriter will be setup when MROutput.initialize is 
called. Think this is avoidable.
- Could be called MultiMROutput - similar to MultiMRInput (which deals with 
multiple readers). Up to you if you want to change this.
Any changes required to the associated OutputCommitter?
Otherwise, looks good to me.

> Support for MultipleOutputs
> ---
>
> Key: TEZ-3215
> URL: https://issues.apache.org/jira/browse/TEZ-3215
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: TEZ-3215-2.patch, TEZ-3215-3.patch, TEZ-3215-4.patch, 
> TEZ-3215-5.patch, TEZ-3215.patch
>
>
> Here is the use case. A reducer might write its output to more than one file. 
> The file name will be based on the mapper key. We don't know all possible 
> keys ahead of time. In MR, MultipleOutputs provides such support. I couldn't 
> find anything readily available in Tez.
> * Set up one DataSink per file ahead of time won't work as we don't know all 
> possible keys.
> * Use MR MultipleOutputs directly from the Tez application processor. It 
> isn't clear how to pass TaskInputOutputContext to MultipleOutputs.
> * Tez MROutput can create a DataSink based on the specified outputFormat. But 
> it can't take MR MultipleOutputs.
> I end up modifying Tez MROutput with HashMap {{recordWriters}} to achieve 
> this. If this is a solved problem, can anyone explain how to do it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3269) Provide basic fair routing and scheduling functionality via custom VertexManager and EdgeManager

2016-10-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578791#comment-15578791
 ] 

Siddharth Seth commented on TEZ-3269:
-

Apologies for the long delay in the review.
Mostly looks good. Would be a lot easier to review if this were split into 
smaller jiras... think it combines a bunch of things like long to int, with the 
core logic changes.
Minor Stuff:
- final where possible - e.g. PartitionsGroupingCalculator.sourceVertexInfo, 
all variables in FairEdgeConfiguration
- This is a fairly complicated patch. Would be good to have some more 
documentation.
   - the ceil method
   - Within various conditions in compute and iterator
- Obligatory rename request: getTotalStatsAtIndex to 
getCurrentlyKnownStatsAtIndex - this method will normally not return totalStats.
- Nit: expectedTotalSourceTasksOutputSize / numOfPartitions; - can be done once 
outside the loop
- onVertexStarted - Should this be split up a little more. It's possible for 
quite a bit to happen at the moment, before the "single vertex only" check is 
hit in FairShufflleVertexManager

Question:
- estimatePartitionSize.partitionstatSizeInMB is across all partitions. This 
ensures that averaging of stats based on output size isn't accidentally hit on 
a 0 sized partition? (Could break earlier from the loop)
- In case of reduce_parallelism - this considers the partition size and may 
produce groups with different number of partitions to consume, which the 
current ShuffleVertexManager doesn't do yet?
- Will the parallelism ever end up getting increased?

Any thoughts on what it will take to move this to support multiple source 
vertices ?

> Provide basic fair routing and scheduling functionality via custom 
> VertexManager and EdgeManager
> 
>
> Key: TEZ-3269
> URL: https://issues.apache.org/jira/browse/TEZ-3269
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Ming Ma
> Attachments: TEZ-3269-2.patch, TEZ-3269-3.patch, TEZ-3269.patch
>
>
> With TEZ-3206 and TEZ-3216, we can build a custom VertexManager and 
> EdgeManager that uses partition stats to do fair routing as well as the 
> scheduling based on destination tasks’ dependency on source tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)