[jira] Updated: (PIG-1198) [zebra] performance improvements

2010-02-25 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1198:
---


Patch reviewed.

Some feedbacks:

1) in fillRowSplit() method, reader.close() should always be called at the end;

2) in mapreduce.TableInputFormat.getRowSplits(), batchSize variable is not 
needed.


Patch looks good overall +1


 [zebra] performance improvements
 

 Key: PIG-1198
 URL: https://issues.apache.org/jira/browse/PIG-1198
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1198.patch, PIG-1198.patch


 Current input split generation is row-based split on individual TFiles. This 
 leaves undesired fact that even for TFiles smaller than one block one split 
 is still generated for each. Consequently, there will be many mappers, and 
 many waves, needed to handle the many small TFiles generated by as many 
 mappers/reducers that wrote the data. This issue can be addressed by 
 generating input splits that can include multiple TFiles. 
 For sorted tables, key distribution generation by table, which is used to 
 generated proper input splits, includes key distributions from column groups 
 even they are not in projection. This incurs extra cost to perform 
 unnecessary computations and, more inappropriately, creates unreasonable 
 results on input split generations; 
 For unsorted tables, when row split is generated on a union of tables, the 
 FileSplits are generated for each table and then lumped together to form the 
 final list of splits to Map/Reduce. This has a undesirable fact that number 
 of splits is subject to the number of tables in the table union and not just 
 controlled by the number of splits used by the Map/Reduce framework; 
 The input split's goal size is calculated on all column groups even if some 
 of them are not in projection; 
 For input splits of multiple files in one column group, all files are opened 
 at startup. This is unnecessary and takes unnecessarily resources from start 
 to end. The files should be opened when needed and closed when not; 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1198) [zebra] performance improvements

2010-02-25 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1198:
--

Attachment: PIG-1198.patch

To address the review comments.

 [zebra] performance improvements
 

 Key: PIG-1198
 URL: https://issues.apache.org/jira/browse/PIG-1198
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1198.patch, PIG-1198.patch, PIG-1198.patch


 Current input split generation is row-based split on individual TFiles. This 
 leaves undesired fact that even for TFiles smaller than one block one split 
 is still generated for each. Consequently, there will be many mappers, and 
 many waves, needed to handle the many small TFiles generated by as many 
 mappers/reducers that wrote the data. This issue can be addressed by 
 generating input splits that can include multiple TFiles. 
 For sorted tables, key distribution generation by table, which is used to 
 generated proper input splits, includes key distributions from column groups 
 even they are not in projection. This incurs extra cost to perform 
 unnecessary computations and, more inappropriately, creates unreasonable 
 results on input split generations; 
 For unsorted tables, when row split is generated on a union of tables, the 
 FileSplits are generated for each table and then lumped together to form the 
 final list of splits to Map/Reduce. This has a undesirable fact that number 
 of splits is subject to the number of tables in the table union and not just 
 controlled by the number of splits used by the Map/Reduce framework; 
 The input split's goal size is calculated on all column groups even if some 
 of them are not in projection; 
 For input splits of multiple files in one column group, all files are opened 
 at startup. This is unnecessary and takes unnecessarily resources from start 
 to end. The files should be opened when needed and closed when not; 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1198) [zebra] performance improvements

2010-02-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1198:
--

Attachment: PIG-1198.patch

This patch is based upon the load-store-redesign branch and thus might have 
minor differences due to different code base from the final patch to be applied 
to the trunk. This patch is teherefore only for reviewing purpose only and no 
submission is intended. 

 [zebra] performance improvements
 

 Key: PIG-1198
 URL: https://issues.apache.org/jira/browse/PIG-1198
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1198.patch


 Current input split generation is row-based split on individual TFiles. This 
 leaves undesired fact that even for TFiles smaller than one block one split 
 is still generated for each. Consequently, there will be many mappers, and 
 many waves, needed to handle the many small TFiles generated by as many 
 mappers/reducers that wrote the data. This issue can be addressed by 
 generating input splits that can include multiple TFiles. 
 For sorted tables, key distribution generation by table, which is used to 
 generated proper input splits, includes key distributions from column groups 
 even they are not in projection. This incurs extra cost to perform 
 unnecessary computations and, more inappropriately, creates unreasonable 
 results on input split generations; 
 For unsorted tables, when row split is generated on a union of tables, the 
 FileSplits are generated for each table and then lumped together to form the 
 final list of splits to Map/Reduce. This has a undesirable fact that number 
 of splits is subject to the number of tables in the table union and not just 
 controlled by the number of splits used by the Map/Reduce framework; 
 The input split's goal size is calculated on all column groups even if some 
 of them are not in projection; 
 For input splits of multiple files in one column group, all files are opened 
 at startup. This is unnecessary and takes unnecessarily resources from start 
 to end. The files should be opened when needed and closed when not; 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.