[jira] Updated: (PIG-1353) Map-side outer joins

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1353:


Summary: Map-side outer joins  (was: Map-side joins)

 Map-side outer joins
 

 Key: PIG-1353
 URL: https://issues.apache.org/jira/browse/PIG-1353
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: pig-1353.patch, pig-1353.patch


 Pig already has couple of map-side join implementations: Merge Join and 
 Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
 Join can only join two tables and that too can only do inner join. FR Join 
 can join multiple relations, but it can also only do inner and left outer 
 joins. Further it restricts the sizes of side relations. It will be nice if 
 we can do map side joins on multiple tables as well do inner, left outer, 
 right outer and full outer joins. 
 Lot of groundwork for this has already been done in PIG-1309. Remaining will 
 be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1353) Map-side outer joins

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1353:


Release Note: 
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and one of the loader implements {{CollectableLoader}} 
interface. Primary algorithm is based on sort-merge join. 

Additional implementation details:
1) No other operations can be done between load and join statements.
2) Data must be sorted in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc.
5) All other loaders must implement IndexableLoadFunc.   

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.
Similiar conditions apply to map-side cogroups (PIG-1309) as well.  

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
C = join A by id left, B by id using 'merge';
.

  was:
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and one of the loader implements {{CollectableLoader}} 
interface. Primary algorithm is based on sort-merge join. 

Additional implementation details:
1) No other operations can be done between load and join statements.
2) Data must be sorted in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc.
5) All other loaders must implement IndexableLoadFunc.   

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.
Similiar conditions apply to map-side cogroups (PIG-1309) as well.  


 Map-side outer joins
 

 Key: PIG-1353
 URL: https://issues.apache.org/jira/browse/PIG-1353
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: pig-1353.patch, pig-1353.patch


 Pig already has couple of map-side join implementations: Merge Join and 
 Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
 Join can only join two tables and that too can only do inner join. FR Join 
 can join multiple relations, but it can also only do inner and left outer 
 joins. Further it restricts the sizes of side relations. It will be nice if 
 we can do map side joins on multiple tables as well do inner, left outer, 
 right outer and full outer joins. 
 Lot of groundwork for this has already been done in PIG-1309. Remaining will 
 be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1353) Map-side outer joins

2010-08-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1353:
--

Release Note: 
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and loaders implement required interfaces. Primary algorithm is 
based on sort-merge join. 

Following preconditions should be met in order to use this feature:
1) No other operations can be done between load and join statements.
2) Data must be sorted on join keys in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}.
5) All other loaders must implement {IndexableLoadFunc}.   
6) Type information must be provided in schema for all the loaders. 

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.

Similar conditions apply to map-side cogroups (PIG-1309) as well.  

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
C = join A by id left, B by id using 'merge';
.

  was:
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and one of the loader implements {{CollectableLoader}} 
interface. Primary algorithm is based on sort-merge join. 

Additional implementation details:
1) No other operations can be done between load and join statements.
2) Data must be sorted in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc.
5) All other loaders must implement IndexableLoadFunc.   

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.
Similiar conditions apply to map-side cogroups (PIG-1309) as well.  

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
C = join A by id left, B by id using 'merge';
.


 Map-side outer joins
 

 Key: PIG-1353
 URL: https://issues.apache.org/jira/browse/PIG-1353
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: pig-1353.patch, pig-1353.patch


 Pig already has couple of map-side join implementations: Merge Join and 
 Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
 Join can only join two tables and that too can only do inner join. FR Join 
 can join multiple relations, but it can also only do inner and left outer 
 joins. Further it restricts the sizes of side relations. It will be nice if 
 we can do map side joins on multiple tables as well do inner, left outer, 
 right outer and full outer joins. 
 Lot of groundwork for this has already been done in PIG-1309. Remaining will 
 be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.