[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-28 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703792#action_12703792
 ] 

Alan Gates commented on PIG-627:


Checked in multiquery-phase3_0423.patch to multiquery branch.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702005#action_12702005
 ] 

Pradeep Kamath commented on PIG-627:


All the work till now (phase 1 and phase2) has now been committed to trunk. A 
tag (pre-multiquery-phase2) was created prior to commiting the multi query work 
since this a significantly big patch. The tag will serve as a baseline to trace 
down regressions.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-20 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700925#action_12700925
 ] 

Pradeep Kamath commented on PIG-627:


reviewed error_handling_0416.patch for additional changes per comment: 
https://issues.apache.org/jira/browse/PIG-627?focusedCommentId=1260page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_1260.
 +1, committed after removing the javadoc related changes which were already 
committed in the previous commit.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-14 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12698857#action_12698857
 ] 

Pradeep Kamath commented on PIG-627:


+1, Patch committed, thanks Gunther!

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
 multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-07 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696861#action_12696861
 ] 

Olga Natkovich commented on PIG-627:


patch reviewed and committed; thanks, Gunther.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
 multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-06 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696350#action_12696350
 ] 

Pradeep Kamath commented on PIG-627:


+1, patch committed. Thanks for the contribution Gunther!

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-01 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694859#action_12694859
 ] 

Pradeep Kamath commented on PIG-627:


+1, patch committed - thanks for the contribution Gunther.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-24 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688957#action_12688957
 ] 

Pradeep Kamath commented on PIG-627:


+1 - committed patch by Gunther to merge changes in trunk to multiquery branch 
- thanks for the contribution Gunther.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688339#action_12688339
 ] 

Pradeep Kamath commented on PIG-627:


Comments for Richard's patch - multiquery-phase2_0313.patch

In MultiQueryOptimizer:
- what about mr not being map only and with mr splittee? - is this not handled 
for now?
- Is the single mapper case and the single map-reduce case when the script has 
an explicit store 'file' and load 'file' - if this is so, then in
mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is 
removed - shouldn't the store remain?   
- There is common code in mergeOnlyMapperSplittee() and 
meregOnlyMapReduceSplittee() which should be moved to a function to reduce the 
code duplication.

Just want to confirm that the multi query optimization is only for map reduce 
mode - since the optimizer is being called in MapReduceLauncher

In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I 
noticed that in POSplit, it causes an exception - I think it should return the 
error whhic would later be caught in the map() or reduce() - a test to make 
sure errors do get caught and cause failures would be good.

spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of 
ReverseDependencyWalker.

The following comment in BinStorage needs to be clarified:
{noformat}
if (!FileLocalizer.fileExists(fileName, storage)) {
// At compile time in batch mode, the file may not exist
// (such as intermediate file). Just return null - the
// same way as we could's get a valid record from the input. -- 
does this actually mean the same way as we would if we did not get a valid 
record ?
return null;
}


{noformat}


 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, 
 multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688356#action_12688356
 ] 

Pradeep Kamath commented on PIG-627:


+1 on Gunther's patch - multiquery_explain_fix.patch. Patch has been committed 
to the multiquery branch - thanks for the fix Gunther!

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, 
 multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688461#action_12688461
 ] 

Pradeep Kamath commented on PIG-627:


+1 on Richard's patch -  multiquery-phase2_0323.patch, patch committed to 
multiquery branch - thanks for the contribution Richard.

A general comment for the multiquery work is to introduce some negative test 
cases (which return POStatus.STATUS_ERR from some operator in the map or reduce 
plan affected by the multiQuqeryOptimizer)  at some point.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680925#action_12680925
 ] 

Richard Ding commented on PIG-627:
--

The multiquery_0306.patch is the right one and doesn't need to regenerate. 
Pradeep will review it today. 

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680997#action_12680997
 ] 

Pradeep Kamath commented on PIG-627:


Sorry about the misunderstanding, I think I looked at a different patch. After 
reviewing the right patch, here are some comments:

The patch throws Java Exceptions like IllegalStateException. This should be 
replaced with the appropriate Exception class (like MRCompilerException) as 
specified in 
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification. The 
exception should be created with the error code, error source and error message 
constructor. New error codes should be introduced if one of the existing ones 
in 
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification#head-9f71d78d362c3307711f98ec9db3ee12b55e92f6
 cannot be used. If new codes are introduced, the wiki table should be updated.

The following can be used to check for file existence in 
BinStorage.determineSchema() - only in the case where the file does not exist, 
null should be returned
{code}
 public static boolean fileExists(String filename, DataStorage store)
throws IOException {
ElementDescriptor elem = store.asElement(filename);
return elem.exists() || globMatchesFiles(elem, store);
}
 {code}   

Instead of introducing a rootsFirst attribute in DependencyOrderWalker, I 
wonder if we should have a ReverseDependencyOrderWalker since that is what the 
rootsFirst == false case will be. If we are not visiting roots to leaf, we 
really are not visiting in a dependency order - so the meaning of dependency 
order is no longer honored - this can be confusing I think. By explicitly 
naming the walker ReverseDependencyOrderWalker, the intent of walking from 
leaves to roots is more clear I think.

In POSplit currently there is a PhysicalPlan representing the merged inner 
plans (where all plans are mutually exclusive) and there is also a 
ListPhysicalPlan which has the same information in the form of a List. In the 
rest of pig code, inner plans have always been modelled as ListPhysicalPlan. 
For consistency, it is better to just have a ListPhysicalPlan to represent 
the inner plans.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681066#action_12681066
 ] 

Richard Ding commented on PIG-627:
--

Thanks for reviewing the patch. These are excellent suggestions. I'll make sure 
that the changes you proposed will be included in the next patch.

On exceptions,  when do we use runtime exceptions? I'm trying to use runtime 
exceptions to indicate programming errors such as precondition violations or 
internal state errors. 

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681085#action_12681085
 ] 

Pradeep Kamath commented on PIG-627:


Committed patch per previous comment that the review comments will be addressed 
in the next patch - thanks Richard for the contribution. 

In general from Pig code we always want to throw known PigExceptions even for 
programming errors or internal state errors - in these cases, we just use the 
source of the Exception as PigExcetion.BUG. RuntimeException should be used 
when we want to throw an exception in a function which cannot throw any 
exceptions (like in methods from Hadoop API which we are implementing which do 
not throw any Exception)

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-10 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680724#action_12680724
 ] 

Pradeep Kamath commented on PIG-627:


multiquery_0306.patch seems to have a lot of code from the earlier patch ( 
multi-store-0304.patch). Richard, can you svn up your code base and regenerate 
the patch with only the changes you intended?

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-06 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679768#action_12679768
 ] 

Olga Natkovich commented on PIG-627:


A am reviewing this patch

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-06 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679791#action_12679791
 ] 

Olga Natkovich commented on PIG-627:


Looks like the patch has been committed but I will add my 2 cents anyways:

(1) Looks like the test cases only test for success or failure but not for the 
correctness of results.
(2) I was not quite sure what we need to executeBatch in grant for every dfs 
command. We treat those commands differently from pig commands anyways.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-06 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679798#action_12679798
 ] 

Richard Ding commented on PIG-627:
--

This patch is for the multi query branch.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-05 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679500#action_12679500
 ] 

Gunther Hagleitner commented on PIG-627:


Oh. I also took out the restriction of the openIterator in batch mode. That was 
no longer needed.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-04 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679021#action_12679021
 ] 

Pradeep Kamath commented on PIG-627:


I committed multi-store-0304.patch into the multi-query branch after 
reviewing the changes.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: multi-store-0303.patch, multi-store-0304.patch, 
 multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-02-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676507#action_12676507
 ] 

Olga Natkovich commented on PIG-627:


I committed the latest patch. Ran the unit tests and they all passed.

Couple of issues that need to be addressed:

(1) PigServer.openIterator, in batch mode, always returns an empty iterator. 
That will not work if a script has a dump in it.
(2) PigSever.getStorePlan assumes that each alias maps to a single store. In 
case of multiple queries that might not be true.

Thanks Richard and Gunther for your contribution!

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-02-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672464#action_12672464
 ] 

Alan Gates commented on PIG-627:


See http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification for a more 
concrete proposal that incorporates and extends the thoughts in the last 
comment.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-01-20 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665669#action_12665669
 ] 

Alan Gates commented on PIG-627:


I propose to implement this as follows.

Currently split works by dumping all of its input to disk, and then starting MR 
jobs for each of it's outputs.  So if you have a script like:

{code}
A = load ...
split A into B, beta ...
C = filter B ...
D = group C ...
E = foreach D ...
store E
gamma = filter beta ...
delta = group gamma ...
epsilon = foreach delta ...
store epsilon
{code}

then A will be loaded and immediately stored by the split.  This output will 
then be loaded before C, and run to the store E.  Then the output will again be
loaded for gamma and run to epsilon.

If instead split was changed to have inner plans like foreach, the the above 
could be executed as A is loaded and the input passed to split.  Each tuple it
received it would run through a pipeline that contained C and a separate 
pipeline that contained gamma.  Separate map reduce jobs would then be started, 
one to
handle D-E and one delta-epsilon.  This turns 3 reads of the data into one plus 
two partials (depending on how selective the two filters are).

The relevance to the current issue is that queries like:

{code}
A = load ..
B = filter A ...
store B ...
C = group B ...
D = foreach C ...
store D;
{code}

would be implicitly converted to:

{code}
A = load ..
B = filter A ...
split B into B1, B2;
store B1 ...
C = group B2 ...
D = foreach C ...
store D;
{code}

Changes needed to accomplish this:
 * Add an optimization pass that takes a plan with splits and rearranges it to 
be contained within the splits plus any subsequent MR jobs.  This may need to 
be split up between the logical to physical translator and the MR compiler.  It 
also needs to be able to handle diamonds in the plan, where split data comes 
back together, either as part of the same MR job or in a later job.
 * Implement a split operator that can contain inner plans.  This is basically 
a foreach without a generate, and so hopefully much of the code from foreach 
could be shared or at least stolen.  It will be somewhat different in that it 
will be able to contain any non-MR boundary forcing task (filter, foreach, 
dump, store) and not be able to contain distinct or order by.




 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.