Re: [jira] Updated: (PIG-1309) Map-side Cogroup
Condition (1) refers to only explicit (user specified) statements right ? Not implicit project introduced by pig to conform to schema ? Regards, Mridul On Saturday 21 August 2010 12:59 AM, Ashutosh Chauhan (JIRA) wrote: [ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Release Note: With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and cogroup statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge'; was: With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and join statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge'; Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0, 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1309: Release Note: With this patch, it is now possible to perform map-side cogroup if data is sorted and one of the loader implements {{CollectableLoader}} interface. Primary algorithm is based on sort-merge join. Additional implementation details: 1) No other operations can be done between load and join statements. 2) Data must be sorted in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement CollectableLoader interface as well as OrderedLoadFunc. 5) All other loaders must implement IndexableLoadFunc. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similiar conditions apply to map-side cogroups (PIG-1309) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); C = COGROUP A by id, B by id using 'merge'; Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0, 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Release Note: With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and join statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge'; was: With this patch, it is now possible to perform map-side cogroup if data is sorted and one of the loader implements {{CollectableLoader}} interface. Primary algorithm is based on sort-merge join. Additional implementation details: 1) No other operations can be done between load and join statements. 2) Data must be sorted in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement CollectableLoader interface as well as OrderedLoadFunc. 5) All other loaders must implement IndexableLoadFunc. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similiar conditions apply to map-side cogroups (PIG-1309) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); C = COGROUP A by id, B by id using 'merge'; Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0, 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Release Note: With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and cogroup statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge'; was: With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and join statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge'; Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0, 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1309: Fix Version/s: 0.7.0 Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0, 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch checked-in to 0.7 branch as well. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: PIG_1309_7.patch Backport of merge cogroup for 0.7 branch. Since, hudson can test only for trunk. Manually ran all the tests, all passed. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1309: Fix Version/s: 0.8.0 Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: pig-1309_2.patch Updated the patch to fix test failures, javac warnings and more comments. Result of test-patch on latest patch: {noformat} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] {noformat} Result of test-commit: {noformat} test-commit: [mkdir] Created dir: /homes/chauhana/scratch/latest/build/test/logs [junit] Running org.apache.pig.test.TestAdd [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.036 sec : : [junit] Running org.apache.pig.test.TestTypeCheckingValidatorNoSchema [junit] Tests run: 13, Failures: 0, Errors: 0, Time elapsed: 0.165 sec BUILD SUCCESSFUL {noformat} Patch checked in trunk. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: (was: pig-1309.patch) Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch, pig-1309_1.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: pig-1309_1.patch Getting closer. Running through hudson to find out if it breaks anything. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch, pig-1309_1.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Status: Patch Available (was: Open) Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch, pig-1309_1.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: pig-1309.patch Did offline review with Alan. Found a subtle bug in POMergeCogroup#getNext(). Fixed that and added more tests. Still need to tidy up things at few places. Looking for suggestion for better test cases that cover all the edge cases. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch, pig-1309.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: mapsideCogrp.patch Preliminary patch to discuss the approach. Not ready for inclusion yet. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.