[Pig Wiki] Update of JoinFramework by OlgaN

2008-10-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/JoinFramework

--
  
  Note that `GROUP` can take advantage of this knowledge as well.
  
- [Discussion of different data layout options.]
+ To support this type of joing the data can be layed out in 2 ways. First, the 
data is globally sorted on the join key and range index is available. Second, 
the data is staticly partitioned in a fixed number of buckets using the same 
partitioning function. The first approach is more flexible since it allows 
arbitrary level of parallelism for processing the data but it is more complex 
and expensive to generate the data. Also, an open question for the second 
approach is how to identify matching partitions; using file name seems like a 
pretty fragile approach.
  
  === Fragment Replicate Join (FRJ) ===
  


[Pig Wiki] Update of JoinFramework by OlgaN

2008-10-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/JoinFramework

New page:
= Join Framework =

== Objective == 

This document provides a comprehensive view of performing joins in Pig. By 
=JOIN= here we mean traditional inner/outer =SQL= joins which in Pig are 
realized via =COGROUP= followed by =flatten= of the relations.

Some of the approaches described in this document can also be applied to 
=CROSS= and =GROUP= as well. 

== Joins ==

Currently, Pig running on top of Hadoop executes all joins in the same way. 
During the map stage, the data from each relation is annotated with the index 
of that relation. Then, the data is sorted and partitioned by the join key and 
provided to the reducer. This is similar to SQL's =hash join=. In the next 
generation Pig (currently on types branch), the data from the same relation is 
guaranteed to be continuous for the same key. This is to allow optimization 
that only keep =N-1= relations in memory. (Unfortunately, we did not see the 
expected speedup when this optimization was tried - investigation is still in 
progress.)

In some situations, more efficient join implementations can be constructed if 
more is known about the data of the relations. They are described in the 
section.

=== Pre-partitioned Join (PPJ) === 

This join type takes advantage of the fact that the data of all relations is 
already partition by the join key or its prefix which means that the join can 
be done completely independently on separate nodes. It further helps if the 
data is sorted on the key; otherwise it might have to get sorted before the 
join.

In the case of =Hadoop=, this means that the join can be done in a =Map= 
avoiding =SORT/SHUFFLE/REDUCE= stages. The performance would be even better if 
the partitions for the same key ranges were collocated on the same nodes and if 
the computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.

Note that GROUP can take advantage of this knowledge as well.

[Discussion of different data layout options.]

=== Fragment Replicate Join (FRJ) === 

This join type takes advantage of the fact that N-1 relations in the join are 
very small and can fit into main memory of each node. In this case, the small 
tables can be copied onto all the nodes and be joined with the data from the 
larger table. This saves the cost of sorting and partitioning the large table. 
For Hadoop this means that the join can happen on the map side. 

The data coming out of the join is not guaranteed to be sorted on the join key 
which could cause problems for queries that follow join by =GROUP= or =ORDER 
BY= on the prefix of the join key. This should be taken into account when 
choosing join type.

If you have several larger tables in the join that can't fit into memory, it 
might be beneficial to split the join to fit FRJ pattern since it would 
significantly reduce the size of the data going into the next join and might 
even allow to use FRJ again.

Note that CROSS can take advantage of this approach as well.

=== Indexed Join (IJ) ===

This join type takes advantage of the fact that one or more tables 
participating in the join have index on the group by key or its prefix. This is 
similar in structure to FRJ join but could be even more efficient since 
processing time can be proportional to the size of the non-indexed and 
hopefully smaller table.

In Hadoop, this will also result in a map side join. 

Currently neither Pig nor Hadoop have indexing structure. So getting to this 
point might take some time and needs some compelling use cases to make the 
investment.


[Pig Wiki] Update of JoinFramework by OlgaN

2008-10-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/JoinFramework

--
  = Join Framework =
  
- == Objective == 
+ == Objective ==
  
- This document provides a comprehensive view of performing joins in Pig. By 
=JOIN= here we mean traditional inner/outer =SQL= joins which in Pig are 
realized via =COGROUP= followed by =flatten= of the relations.
+ This document provides a comprehensive view of performing joins in Pig. By 
=JOIN= here we mean traditional inner/outer SQL joins which in Pig are realized 
via =COGROUP= followed by flatten of the relations.
  
- Some of the approaches described in this document can also be applied to 
=CROSS= and =GROUP= as well. 
+ Some of the approaches described in this document can also be applied to 
CROSS and GROUP as well. 
  
  == Joins ==
  
- Currently, Pig running on top of Hadoop executes all joins in the same way. 
During the map stage, the data from each relation is annotated with the index 
of that relation. Then, the data is sorted and partitioned by the join key and 
provided to the reducer. This is similar to SQL's =hash join=. In the next 
generation Pig (currently on types branch), the data from the same relation is 
guaranteed to be continuous for the same key. This is to allow optimization 
that only keep =N-1= relations in memory. (Unfortunately, we did not see the 
expected speedup when this optimization was tried - investigation is still in 
progress.)
+ Currently, Pig running on top of Hadoop executes all joins in the same way. 
During the map stage, the data from each relation is annotated with the index 
of that relation. Then, the data is sorted and partitioned by the join key and 
provided to the reducer. This is similar to SQL's hash join. In the next 
generation Pig (currently on types branch), the data from the same relation is 
guaranteed to be continuous for the same key. This is to allow optimization 
that only keep N-1 relations in memory. (Unfortunately, we did not see the 
expected speedup when this optimization was tried - investigation is still in 
progress.)
  
  In some situations, more efficient join implementations can be constructed if 
more is known about the data of the relations. They are described in the 
section.
  
- === Pre-partitioned Join (PPJ) === 
+ === Pre-partitioned Join (PPJ) ===
  
  This join type takes advantage of the fact that the data of all relations is 
already partition by the join key or its prefix which means that the join can 
be done completely independently on separate nodes. It further helps if the 
data is sorted on the key; otherwise it might have to get sorted before the 
join.
  
- In the case of =Hadoop=, this means that the join can be done in a =Map= 
avoiding =SORT/SHUFFLE/REDUCE= stages. The performance would be even better if 
the partitions for the same key ranges were collocated on the same nodes and if 
the computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.
+ In the case of Hadoop, this means that the join can be done in a Map avoiding 
SORT/SHUFFLE/REDUCE stages. The performance would be even better if the 
partitions for the same key ranges were collocated on the same nodes and if the 
computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.
  
  Note that GROUP can take advantage of this knowledge as well.
  
  [Discussion of different data layout options.]
  
- === Fragment Replicate Join (FRJ) === 
+ === Fragment Replicate Join (FRJ) ===
  
  This join type takes advantage of the fact that N-1 relations in the join are 
very small and can fit into main memory of each node. In this case, the small 
tables can be copied onto all the nodes and be joined with the data from the 
larger table. This saves the cost of sorting and partitioning the large table. 
  For Hadoop this means that the join can happen on the map side. 
  
- The data coming out of the join is not guaranteed to be sorted on the join 
key which could cause problems for queries that follow join by =GROUP= or 
=ORDER BY= on the prefix of the join key. This should be taken into account 
when choosing join type.
+ The data coming out of the join is not guaranteed to be sorted on the join 
key which could cause problems for queries that follow join by GROUP or ORDER 
BY on the prefix of the join key. This should be taken into account when 
choosing join type.
  
  If you have several larger tables in the join that can't fit into memory, it 
might be beneficial to split the join to fit FRJ pattern since it would 
significantly reduce the size of the data going into the next join and might 
even allow to use FRJ again.