[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-06-18 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-627:
---

Fix Version/s: 0.3.0

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Fix For: 0.3.0

 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-627:
-

Attachment: multiquery-phase3_0423.patch

This patch completes the phase 3 development which merges multiple map-reduce 
aplittees into a splitter.

As an example, the Pig script

{code}
A = load ...
split A into B, beta ...
C = filter B ...
D = group C ...
E = foreach D ...
store E
gamma = filter beta ...
delta = group gamma ...
epsilon = foreach delta ...
store epsilon
{code}

discussed earlier in this bug now results in a single map-reduce job.

This patch is for the multi query branch.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-16 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: error_handling_0416.patch

Fixed some issues with the error handling patch (0415):

   * Duplicated error code 2129
   * Unclear string splitter
   * Added native exception message to error msg in store operator.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-15 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: doc-fix.patch

javadoc changes only. doc-fix.patch contains fixes to silence javadoc 
warnings.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, file_cmds-0305.patch, 
 fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-15 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: error_handling_0415.patch

This patch contains:

   * Error codes/msg
   * Javadoc changes
   * fix the merge error in parser (aliases cmd)
   * updated golden files

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, error_handling_0415.patch, 
 file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
 multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: streaming-fix.patch

Some fixes in the patch streaming-fix.patch:

   * The split operator wasn't always playing nicely with the way we run the 
pipeline one extra time in the mapper's or reducer's close function if there's 
a stream operator present
   * Moved the MR optimizer that sets stream in map and stream in reduce to 
the end of the queue.
   * PhyPlanVisitor forgets to pop some walkers it pushed on the stack. That 
can result in the NoopFilterRemoval stage failing, because it's looking in the 
wrong plan.
   * Setting the jobname by default to the scriptname came in through the last 
merge, but didn't work anymore

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
 multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: merge-041409.patch

merge-041409.patch contains the latest merge from trunk to branch.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-07 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: merge_trunk_to_branch.patch

Merge latest trunk changes to branch

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
 multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-04 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: non_reversible_store_load_dependencies_2.patch

Same as above plus:

   * Fix for explain when a script has execution points inside. 

Like:

{{{
a = load ...
...
store a
exec;
b = load ...
...
}}}

This will run explain once for each execution block.


 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-30 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: noop_filter_absolute_path_flag.patch

This patch contains three items:

- Removes the noop stores as described above
- Makes load and store paths absolute and canonical
- Introduces a flag that turns multiquery on and off (default is off)

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 noop_filter_absolute_path_flag.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-24 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: merge_741727_HEAD__0324_2.patch

Seems like the last merge patch didn't correctly contain the entire new 
TestFinish.java file. Well, this one does.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-19 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multiquery_explain_fix.patch

Fixes three issues with explain:

a) Ceci n'est pas un bug. Splits in interactive mode still need this branch.
b) explain needs to discard batch iff it was loading a script
c) Split is now a nested operator (and explain needs to know)

This patch doesn't have any overlapped files with Richards last patch.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, 
 multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-13 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-627:
-

Attachment: multiquery-phase2_0313.patch

 This patch completes the phase 2 development as sepecified in the document 
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification:
   # Allow multiple stores in single job
   # Merge multiple plans into the split operator
   # Terminate all but one with stores

This patch is for the multi query branch.




 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, 
 multiquery_0224.patch, multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-06 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-627:
-

Attachment: multiquery_0306.patch

This patch contains the enhanced split operator to support multi-store queries. 
It instroduces a new MROperPlan adjuster that merges single-load mapper-only 
MapReduceOper to its predecesor based on the (implicit) split boundary. The 
goal is to reduce the total number of MR jobs for a given multi-query task.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-05 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: file_cmds-0305.patch

This patch is for the multi query branch again. It mostly fixes the problem 
with certain commands in the script that require immediate execution (in batch 
mode).

So if you do stuff like:

...
store a into 'tmp_foo';
...
rm tmp_foo
...

The rm will trigger execution and the file will be there for you to delete, 
copyToLocal, move, etc. You can also use the exec statement without params in 
a script now, to force execution of what we've seen so far.

This patch also contains a minor fix with the computation of progress in MR 
jobs (which I screwed up in the last patch).



 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-04 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multi-store-0304.patch

Same as the other one except: 

- Documented the createStoreFunction method some more.
- Removed unnecessary fields in the path parsing
- Moved tear down of stores below extra streaming run (in PigMapBase's, 
PigMapReduce's close function)

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: multi-store-0303.patch, multi-store-0304.patch, 
 multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-03 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multi-store-0303.patch

This patch introduces the functionality to support multiple stores in a single 
MR job. It's for the multiquery branch and it is needed to unblock concurrent 
dev on the split operator.

There aren't enough unit tests in this patch yet. They will be provided once 
the split operator can use multi stores (right now, nothing actually uses these 
stores, so testing is difficult). In order to test the patch, I had temporarily 
turned multi store on for all queries (even if they only have one store) and 
then ran all the unit tests. All tests passed.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: multi-store-0303.patch, multiquery_0223.patch, 
 multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-02-24 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multiquery_0224.patch

This patch includes the multiquery unit test cases.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-02-23 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multiquery_0223.patch

This is for the multiquery branch. It's phase 1. It contains a lot of 
infrastructural work to be able to look at entire scripts during evaluation 
(batch mode). It will look at a script plan and insert splits whenever there is 
a shared sequence of operations. The split execution is still the same as it 
was before (load-store bridge).

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: multiquery_0223.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.