[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-627: --- Fix Version/s: 0.3.0 PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Fix For: 0.3.0 Attachments: doc-fix.patch, error_handling_0415.patch, error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-627: - Attachment: multiquery-phase3_0423.patch This patch completes the phase 3 development which merges multiple map-reduce aplittees into a splitter. As an example, the Pig script {code} A = load ... split A into B, beta ... C = filter B ... D = group C ... E = foreach D ... store E gamma = filter beta ... delta = group gamma ... epsilon = foreach delta ... store epsilon {code} discussed earlier in this bug now results in a single map-reduce job. This patch is for the multi query branch. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: doc-fix.patch, error_handling_0415.patch, error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: error_handling_0416.patch Fixed some issues with the error handling patch (0415): * Duplicated error code 2129 * Unclear string splitter * Added native exception message to error msg in store operator. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: doc-fix.patch, error_handling_0415.patch, error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: doc-fix.patch javadoc changes only. doc-fix.patch contains fixes to silence javadoc warnings. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: doc-fix.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: error_handling_0415.patch This patch contains: * Error codes/msg * Javadoc changes * fix the merge error in parser (aliases cmd) * updated golden files PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: doc-fix.patch, error_handling_0415.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: streaming-fix.patch Some fixes in the patch streaming-fix.patch: * The split operator wasn't always playing nicely with the way we run the pipeline one extra time in the mapper's or reducer's close function if there's a stream operator present * Moved the MR optimizer that sets stream in map and stream in reduce to the end of the queue. * PhyPlanVisitor forgets to pop some walkers it pushed on the stack. That can result in the NoopFilterRemoval stage failing, because it's looking in the wrong plan. * Setting the jobname by default to the scriptname came in through the last merge, but didn't work anymore PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: merge-041409.patch merge-041409.patch contains the latest merge from trunk to branch. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: merge_trunk_to_branch.patch Merge latest trunk changes to branch PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: non_reversible_store_load_dependencies_2.patch Same as above plus: * Fix for explain when a script has execution points inside. Like: {{{ a = load ... ... store a exec; b = load ... ... }}} This will run explain once for each execution block. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: noop_filter_absolute_path_flag.patch This patch contains three items: - Removes the noop stores as described above - Makes load and store paths absolute and canonical - Introduces a flag that turns multiquery on and off (default is off) PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, noop_filter_absolute_path_flag.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: merge_741727_HEAD__0324_2.patch Seems like the last merge patch didn't correctly contain the entire new TestFinish.java file. Well, this one does. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multiquery_explain_fix.patch Fixes three issues with explain: a) Ceci n'est pas un bug. Splits in interactive mode still need this branch. b) explain needs to discard batch iff it was loading a script c) Split is now a nested operator (and explain needs to know) This patch doesn't have any overlapped files with Richards last patch. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-627: - Attachment: multiquery-phase2_0313.patch This patch completes the phase 2 development as sepecified in the document http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification: # Allow multiple stores in single job # Merge multiple plans into the split operator # Terminate all but one with stores This patch is for the multi query branch. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-627: - Attachment: multiquery_0306.patch This patch contains the enhanced split operator to support multi-store queries. It instroduces a new MROperPlan adjuster that merges single-load mapper-only MapReduceOper to its predecesor based on the (implicit) split boundary. The goal is to reduce the total number of MR jobs for a given multi-query task. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Olga Natkovich Fix For: types_branch Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: file_cmds-0305.patch This patch is for the multi query branch again. It mostly fixes the problem with certain commands in the script that require immediate execution (in batch mode). So if you do stuff like: ... store a into 'tmp_foo'; ... rm tmp_foo ... The rm will trigger execution and the file will be there for you to delete, copyToLocal, move, etc. You can also use the exec statement without params in a script now, to force execution of what we've seen so far. This patch also contains a minor fix with the computation of progress in MR jobs (which I screwed up in the last patch). PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Olga Natkovich Fix For: types_branch Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multi-store-0304.patch Same as the other one except: - Documented the createStoreFunction method some more. - Removed unnecessary fields in the path parsing - Moved tear down of stores below extra streaming run (in PigMapBase's, PigMapReduce's close function) PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Olga Natkovich Fix For: types_branch Attachments: multi-store-0303.patch, multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multi-store-0303.patch This patch introduces the functionality to support multiple stores in a single MR job. It's for the multiquery branch and it is needed to unblock concurrent dev on the split operator. There aren't enough unit tests in this patch yet. They will be provided once the split operator can use multi stores (right now, nothing actually uses these stores, so testing is difficult). In order to test the patch, I had temporarily turned multi store on for all queries (even if they only have one store) and then ran all the unit tests. All tests passed. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Olga Natkovich Fix For: types_branch Attachments: multi-store-0303.patch, multiquery_0223.patch, multiquery_0224.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multiquery_0224.patch This patch includes the multiquery unit test cases. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Olga Natkovich Fix For: types_branch Attachments: multiquery_0223.patch, multiquery_0224.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multiquery_0223.patch This is for the multiquery branch. It's phase 1. It contains a lot of infrastructural work to be able to look at entire scripts during evaluation (batch mode). It will look at a script plan and insert splits whenever there is a shared sequence of operations. The split execution is still the same as it was before (load-store bridge). PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Olga Natkovich Fix For: types_branch Attachments: multiquery_0223.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.