Author: olga
Date: Fri Sep  3 21:22:39 2010
New Revision: 992477

URL: http://svn.apache.org/viewvc?rev=992477&view=rev
Log:
PIG-1600: Docs update (chandec via olgan)

Modified:
    hadoop/pig/trunk/CHANGES.txt
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tutorial.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml

Modified: hadoop/pig/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/CHANGES.txt?rev=992477&r1=992476&r2=992477&view=diff
==============================================================================
--- hadoop/pig/trunk/CHANGES.txt (original)
+++ hadoop/pig/trunk/CHANGES.txt Fri Sep  3 21:22:39 2010
@@ -36,6 +36,8 @@ PIG-1249: Safe-guards against misconfigu
 
 IMPROVEMENTS
 
+PIG-1600: Docs update (chandec via olgan)
+
 PIG-1585: Add new properties to help and documentation(olgan)
 
 PIG-1399: Filter expression optimizations (yanz via gates)

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=992477&r1=992476&r2=992477&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml 
(original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml Fri 
Sep  3 21:22:39 2010
@@ -259,7 +259,7 @@ B = GROUP A BY t PARALLEL 18;
 
 <p>In this example all the MapReduce jobs that get launched use 20 
reducers.</p>
 <source>
-SET DEFAULT_PARALLEL 20;
+SET default_parallel 20;
 A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
 B = GROUP A BY t;
 C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
@@ -292,11 +292,11 @@ C = limit B 500;
 </section>
 
 <section>
-<title>Prefer DISTINCT over GROUP BY - GENERATE</title>
+<title>Prefer DISTINCT over GROUP BY/GENERATE</title>
 
-<p>When it comes to extracting the unique values from a column in a relation, 
one of two approaches can be used: </p>
+<p>To extract unique values from a column in a relation you can use DISTINCT 
or GROUP BY/GENERATE. DISTINCT is the preferred method; it is faster and more 
efficient.</p>
 
-<p>Example Using GROUP BY - GENERATE</p>
+<p>Example using GROUP BY - GENERATE:</p>
 
 <source>
 A = load 'myfile' as (t, u, v);
@@ -306,7 +306,7 @@ D = foreach C generate group as uniqueke
 dump D; 
 </source>
 
-<p>Example Using DISTINCT</p>
+<p>Example using DISTINCT:</p>
 
 <source>
 A = load 'myfile' as (t, u, v);
@@ -315,8 +315,6 @@ C = distinct B;
 dump C; 
 </source>
 
-<p>In pig 0.1.x, DISTINCT is just GROUP BY/PROJECT under the hood. In pig 
0.2.0 it is not, and it is much faster and more efficient (depending on your 
key cardinality, up to 20x faster in pig team's tests). Therefore, the use of 
DISTINCT is recommended over GROUP BY - GENERATE.  </p>
-
 </section>
 </section>
 </body>

Modified: 
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml?rev=992477&r1=992476&r2=992477&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml 
(original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml 
Fri Sep  3 21:22:39 2010
@@ -32,6 +32,8 @@
    <p>Also, be sure to review the information in the <a 
href="cookbook.html">Pig Cookbook</a>. </p>
     </section>
     
+  <!-- ==================================================================== -->
+    
    <!-- PIG LATIN STATEMENTS -->
    <section>
        <title>Pig Latin Statements</title>     
@@ -62,8 +64,7 @@
     <li>Either interactively or in batch </li>
    </ul>
    
-   
-   <p></p>
+ <p></p>
 <p>Note that Pig now uses Hadoop's local mode (rather than Pig's native local 
mode).</p>   
 <p>A few run examples are shown here; see <a href="setup.html">Pig Setup</a> 
for more examples.</p>   
    
@@ -126,8 +127,6 @@ DUMP B;
    
    <p> </p>
    <p>See <a href="#Multi-Query+Execution">Multi-Query Execution</a> for more 
information on how Pig Latin statements are processed.</p> 
-   
-   
    </section>
    
    <section>
@@ -233,8 +232,20 @@ grunt> DUMP C;
 </section>  
 <!-- END PIG LATIN STATEMENTS -->
 
- 
+<!-- ================================================================== -->
+<!-- MEMORY MANAGEMENT -->
+<section>
+<title>Memory Management</title>
 
+<p>Pig allocates a fix amount of memory to store bags and spills to disk as 
soon as the memory limit is reached. This is very similar to how Hadoop decides 
when to spill data accumulated by the combiner. </p>
+
+<p>The amount of memory allocated to bags is determined by 
pig.cachedbag.memusage; the default is set to 10% of available memory. Note 
that this memory is shared across all large bags used by the application.</p>
+
+</section> 
+<!-- END MEMORY MANAGEMENT  -->
+
+
+<!-- ==================================================================== -->
 <!-- MULTI-QUERY EXECUTION-->
 <section>
 <title>Multi-Query Execution</title>
@@ -378,7 +389,6 @@ $ pig -F myscript.pig
 or
 $ pig -stop_on_failure myscript.pig
 </source>
-
 </section>
 
 <section>
@@ -497,25 +507,189 @@ Ftab = group ....
 Gtab = .... aggregation function
 STORE Gtab INTO '/user/vxj/finalresult2';
 </source>
+</section>
+</section>
+
+</section>
+<!-- END MULTI-QUERY EXECUTION-->
 
 
+<!-- ==================================================================== -->
+ <!-- NULLS -->
+ <section>
+<title>Null Values</title>
+<p>Pig handles null values differently for the GROUP/COGROUP and JOIN 
operations.</p>
+ 
+ <section>
+<title>GROUP/COGROUP and Null Values</title>
+
+<p>When using the GROUP operator with a single relation, records with a null 
group key are grouped together.</p>
+<source>
+a = load 'student' as (name:chararray, age:int, gpa:float);
+dump a;
+(joe,18,2.5)
+(sam,,3.0)
+(bob,,3.5)
+
+x = group a by age;
+dump x;
+(18,{(joe,18,2.5)})
+(,{(sam,,3.0),(bob,,3.5)})
+</source>
+  
+<p>When using the GROUP (COGROUP) operator with multiple relations, records 
with a null group key are considered different and are grouped separately. 
+In the example below note that there are two tuples in the output 
corresponding to the null group key: 
+one that contains tuples from relation A (but not relation B) and one that 
contains tuples from relation B (but not relation A).</p>
+
+<source>
+A = load 'student' as (name:chararray, age:int, gpa:float);
+B = load 'student' as (name:chararray, age:int, gpa:float);
+dump B;
+(joe,18,2.5)
+(sam,,3.0)
+(bob,,3.5)
+
+X = cogroup A by age, B by age;
+dump X;
+(18,{(joe,18,2.5)},{(joe,18,2.5)})
+(,{(sam,,3.0),(bob,,3.5)},{})
+(,{},{(sam,,3.0),(bob,,3.5)})
+</source>
 </section>
+ 
+ <section>
+<title>JOIN and Null Values</title>
+<p>The JOIN operator - when performing inner joins - adheres to the SQL 
standard and disregards (filters out) null values.</p>
+ <source>
+A = load 'student' as (name:chararray, age:int, gpa:float);
+B = load 'student' as (name:chararray, age:int, gpa:float);
+dump B;
+(joe,18,2.5)
+(sam,,3.0)
+(bob,,3.5)
+  
+X = join A by age, B by age;
+dump X;
+(joe,18,2.5,joe,18,2.5)
+</source>
 </section>
 
+ </section>
+ <!-- END NULLS -->
+
+<!-- ==================================================================== -->
+ 
+ <!-- OPTIMIZATION RULES -->
+<section>
+<title>Optimization Rules</title>
+<p>Pig supports various optimization rules. By default optimization, and all 
optimization rules, are turned on. 
+To turn off optimiztion, use:</p>
+
+<source>
+pig -optimizer_off [opt_rule | all ]
+</source>
+
+<p>Note that some rules are mandatory and cannot be turned off.</p>
+
+<section>
+<title>ImplicitSplitInserter</title>
+<p>Status: Mandatory</p>
+<p>
+<a href="piglatin_ref2.html#SPLIT">SPLIT</a> is the only operator that models 
multiple outputs in Pig. 
+To ease the process of building logical plans, all operators are allowed to 
have multiple outputs. As part of the 
+optimization, all non-split operators that have multiple outputs are altered 
to have a SPLIT operator as the output 
+and the outputs of the operator are then made outputs of the SPLIT operator. 
An example will illustrate the point. 
+Here, a split will be inserted after the LOAD and the split outputs will be 
connected to the FILTER (b) and the COGROUP (c).
+</p>
+<source>
+A = LOAD 'input';
+B = FILTER A BY $1 == 1;
+C = COGROUP A BY $0, B BY $0;
+</source>
 </section>
-<!-- END MULTI-QUERY EXECUTION-->
 
+<section>
+<title>TypeCastInserter</title>
+<p>Status: Mandatory</p>
+<p>
+If you specify a <a href="piglatin_ref2.html#Schemas">schema</a> with the 
+<a href="piglatin_ref2.html#LOAD">LOAD</a> statement, the optimizer will 
perform a pre-fix projection of the columns 
+and <a href="piglatin_ref2.html#Cast+Operators">cast</a> the columns to the 
appropriate types. An example will illustrate the point. 
+The LOAD statement (a) has a schema associated with it. The optimizer will 
insert a FOREACH operator that will project columns 0, 1 and 2 
+and also cast them to chararray, int and float respectively. 
+</p>
+<source>
+A = LOAD 'input' AS (name: chararray, age: int, gpa: float);
+B = FILER A BY $1 == 1;
+C = GROUP A By $0;
+</source>
+</section>
+
+<section>
+<title>StreamOptimizer</title>
+<p>
+Optimize when <a href="piglatin_ref2.html#LOAD">LOAD</a> precedes <a 
href="piglatin_ref2.html#STREAM">STREAM</a> 
+and the loader class is the same as the serializer for the stream. Similarly, 
optimize when STREAM is followed by 
+<a href="piglatin_ref2.html#STORE">STORE</a> and the deserializer class is 
same as the storage class. 
+For both of these cases the optimization is to replace the loader/serializer 
with BinaryStorage which just moves bytes 
+around and to replace the storer/deserializer with BinaryStorage.
+</p>
+
+</section>
 
+<section>
+<title>OpLimitOptimizer</title>
+<p>
+The objective of this rule is to push the <a 
href="piglatin_ref2.html#LIMIT">LIMIT</a> operator up the data flow graph 
+(or down the tree for database folks). In addition, for top-k (ORDER BY 
followed by a LIMIT) the LIMIT is pushed into the ORDER BY.
+</p>
+<source>
+A = LOAD 'input';
+B = ORDER A BY $0;
+C = LIMIT B 10;
+</source>
+</section>
+
+<section>
+<title>PushUpFilters</title>
+<p>
+The objective of this rule is to push the <a 
href="piglatin_ref2.html#FILTER">FILTER</a> operators up the data flow graph. 
+As a result, the number of records that flow through the pipeline is reduced. 
+</p>
+<source>
+A = LOAD 'input';
+B = GROUP A BY $0;
+C = FILTER B BY $0 &lt; 10;
+</source>
+</section>
+
+<section>
+<title>PushDownExplodes</title>
+<p>
+The objective of this rule is to reduce the number of records that flow 
through the pipeline by moving 
+<a href="piglatin_ref2.html#FOREACH">FOREACH</a> operators with a 
+<a href="piglatin_ref2.html#Flatten+Operator">FLATTEN</a> down the data flow 
graph. 
+In the example shown below, it would be more efficient to move the foreach 
after the join to reduce the cost of the join operation.
+</p>
+<source>
+A = LOAD 'input' AS (a, b, c);
+B = LOAD 'input2' AS (x, y, z);
+C = FOREACH A GENERATE FLATTEN($0), B, C;
+D = JOIN C BY $1, B BY $1;
+</source>
+</section>
+</section> <!-- END OPTIMIZATION RULES -->
+
+
+<!-- ==================================================================== -->
 
 <!-- SPECIALIZED JOINS-->
 <section>
 <title>Specialized Joins</title>
 <p>
-Pig Latin includes three "specialized" joins: replicated joins, skewed joins, 
and merge joins. </p>
-<ul>
-<li>Replicated, skewed, and merge joins can be performed using <a 
href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a>.</li>
-<li>Replicated and skewed joins can also be performed using <a 
href="piglatin_ref2.html#JOIN+%28outer%29">outer joins</a>.</li>
-</ul>
+In certain cases, the performance of <a 
href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a> and <a 
href="piglatin_ref2.html#JOIN+%28outer%29">outer joins</a>
+can be optimized using replicated, skewed, or merge joins. </p>
+
 
 <!-- FRAGMENT REPLICATE JOINS-->
 <section>
@@ -618,48 +792,61 @@ assumed to have been done (see the Condi
 Pig implements the merge join algorithm by selecting the left input of the 
join to be the input file for the map phase, 
 and the right input of the join to be the side file. It then samples records 
from the right input to build an
  index that contains, for each sampled record, the key(s) the filename and the 
offset into the file the record 
- begins at. This sampling is done in an initial map only job. A second 
MapReduce job is then initiated, 
+ begins at. This sampling is done in the first MapReduce job. A second 
MapReduce job is then initiated, 
  with the left input as its input. Each map uses the index to seek to the 
appropriate record in the right 
  input and begin doing the join. 
 </p>
 
 <section>
 <title>Usage</title>
-<p>Perform a merge join with the USING clause (see <a 
href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a>).</p>
+<p>Perform a merge join with the USING clause (see <a 
href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a> and <a 
href="piglatin_ref2.html#JOIN+%28outer%29">outer joins</a>). </p>
 <source>
-C = JOIN A BY a1, B BY b1 USING 'merge';
+C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge';
 </source>
 </section>
 
 <section>
 <title>Conditions</title>
-<p>
-Merge join will only work under these conditions: 
-</p>
-
+<p><strong>Condition A</strong></p>
+<p>Inner merge join (between two tables) will only work under these 
conditions: </p>
 <ul>
-<li>Both inputs are sorted in *ascending* order of join keys. If an input 
consists of many files, there should be 
-a total ordering across the files in the *ascending order of file name*. So 
for example if one of the inputs to the 
-join is a directory called input1 with files a and b under it, the data should 
be sorted in ascending order of join 
-key when read starting at a and ending in b. Likewise if an input directory 
has part files part-00000, part-00001, 
-part-00002 and part-00003, the data should be sorted if the files are read in 
the sequence part-00000, part-00001, 
-part-00002 and part-00003. </li>
-<li>The merge join only has two inputs </li>
-<li>The loadfunc for the right input of the join should implement the 
OrderedLoadFunc interface (PigStorage does 
-implement the OrderedLoadFunc interface). </li>
-<li>Only inner join will be supported </li>
-
 <li>Between the load of the sorted input and the merge join statement there 
can only be filter statements and 
 foreach statement where the foreach statement should meet the following 
conditions: 
 <ul>
-<li>There should be no UDFs in the foreach statement </li>
-<li>The foreach statement should not change the position of the join keys </li>
-<li>There should not transformation on the join keys which will change the 
sort order </li>
+<li>There should be no UDFs in the foreach statement. </li>
+<li>The foreach statement should not change the position of the join keys. 
</li>
+<li>There should be no transformation on the join keys which will change the 
sort order. </li>
 </ul>
 </li>
+<li>Data must be sorted on join keys in ascending (ASC) order on both 
sides.</li>
+<li>Right-side loader must implement either the {OrderedLoadFunc} interface or 
{IndexableLoadFunc} interface.</li>
+<li>Type information must be provided for the join key in the schema.</li>
+</ul>
+<p></p>
+<p>The Zebra and PigStorage loaders satisfy all of these conditions.</p>
+<p></p>
 
+<p><strong>Condition B</strong></p>
+<p>Outer merge join (between two tables) and inner merge join (between three 
or more tables) will only work under these conditions: </p>
+<ul>
+<li>No other operations can be done between the load and join statements. </li>
+<li>Data must be sorted on join keys in ascending (ASC) order on both sides. 
</li>
+<li>Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}. </li>
+<li>All other loaders must implement {IndexableLoadFunc}. </li>
+<li>Type information must be provided for the join key in the schema.</li>
 </ul>
 <p></p>
+<p>The Zebra loader satisfies all of these conditions.</p>
+
+<p>An example of a left outer merge join using the Zebra loader:</p>
+<source>
+A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
+B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
+C = join A by id left, B by id using 'merge'; 
+</source>
+
+<p></p>
+<p><strong>Both Conditions</strong></p>
 <p>
 For optimal performance, each part file of the left (sorted) input of the join 
should have a size of at least 
 1 hdfs block size (for example if the hdfs block size is 128 MB, each part 
file should be less than 128 MB). 
@@ -668,138 +855,22 @@ If the total input size (including all p
 job performing the merge-join will process. 
 </p>
 
-<p>
-In local mode, merge join will revert to regular join.
-</p>
 </section>
 </section><!-- END MERGE JOIN -->
 
-
 </section>
 <!-- END SPECIALIZED JOINS-->
- 
- <!-- OPTIMIZATION RULES -->
-<section>
-<title>Optimization Rules</title>
-<p>Pig supports various optimization rules. By default optimization, and all 
optimization rules, are turned on. 
-To turn off optimiztion, use:</p>
-
-<source>
-pig -optimizer_off [opt_rule | all ]
-</source>
 
-<p>Note that some rules are mandatory and cannot be turned off.</p>
-
-<section>
-<title>ImplicitSplitInserter</title>
-<p>Status: Mandatory</p>
-<p>
-<a href="piglatin_ref2.html#SPLIT">SPLIT</a> is the only operator that models 
multiple outputs in Pig. 
-To ease the process of building logical plans, all operators are allowed to 
have multiple outputs. As part of the 
-optimization, all non-split operators that have multiple outputs are altered 
to have a SPLIT operator as the output 
-and the outputs of the operator are then made outputs of the SPLIT operator. 
An example will illustrate the point. 
-Here, a split will be inserted after the LOAD and the split outputs will be 
connected to the FILTER (b) and the COGROUP (c).
-</p>
-<source>
-A = LOAD 'input';
-B = FILTER A BY $1 == 1;
-C = COGROUP A BY $0, B BY $0;
-</source>
-</section>
-
-<section>
-<title>TypeCastInserter</title>
-<p>Status: Mandatory</p>
-<p>
-If you specify a <a href="piglatin_ref2.html#Schemas">schema</a> with the 
-<a href="piglatin_ref2.html#LOAD">LOAD</a> statement, the optimizer will 
perform a pre-fix projection of the columns 
-and <a href="piglatin_ref2.html#Cast+Operators">cast</a> the columns to the 
appropriate types. An example will illustrate the point. 
-The LOAD statement (a) has a schema associated with it. The optimizer will 
insert a FOREACH operator that will project columns 0, 1 and 2 
-and also cast them to chararray, int and float respectively. 
-</p>
-<source>
-A = LOAD 'input' AS (name: chararray, age: int, gpa: float);
-B = FILER A BY $1 == 1;
-C = GROUP A By $0;
-</source>
-</section>
-
-<section>
-<title>StreamOptimizer</title>
-<p>
-Optimize when <a href="piglatin_ref2.html#LOAD">LOAD</a> precedes <a 
href="piglatin_ref2.html#STREAM">STREAM</a> 
-and the loader class is the same as the serializer for the stream. Similarly, 
optimize when STREAM is followed by 
-<a href="piglatin_ref2.html#STORE">STORE</a> and the deserializer class is 
same as the storage class. 
-For both of these cases the optimization is to replace the loader/serializer 
with BinaryStorage which just moves bytes 
-around and to replace the storer/deserializer with BinaryStorage.
-</p>
-
-</section>
-
-<section>
-<title>OpLimitOptimizer</title>
-<p>
-The objective of this rule is to push the <a 
href="piglatin_ref2.html#LIMIT">LIMIT</a> operator up the data flow graph 
-(or down the tree for database folks). In addition, for top-k (ORDER BY 
followed by a LIMIT) the LIMIT is pushed into the ORDER BY.
-</p>
-<source>
-A = LOAD 'input';
-B = ORDER A BY $0;
-C = LIMIT B 10;
-</source>
-</section>
-
-<section>
-<title>PushUpFilters</title>
-<p>
-The objective of this rule is to push the <a 
href="piglatin_ref2.html#FILTER">FILTER</a> operators up the data flow graph. 
-As a result, the number of records that flow through the pipeline is reduced. 
-</p>
-<source>
-A = LOAD 'input';
-B = GROUP A BY $0;
-C = FILTER B BY $0 &lt; 10;
-</source>
-</section>
-
-<section>
-<title>PushDownExplodes</title>
-<p>
-The objective of this rule is to reduce the number of records that flow 
through the pipeline by moving 
-<a href="piglatin_ref2.html#FOREACH">FOREACH</a> operators with a 
-<a href="piglatin_ref2.html#Flatten+Operator">FLATTEN</a> down the data flow 
graph. 
-In the example shown below, it would be more efficient to move the foreach 
after the join to reduce the cost of the join operation.
-</p>
-<source>
-A = LOAD 'input' AS (a, b, c);
-B = LOAD 'input2' AS (x, y, z);
-C = FOREACH A GENERATE FLATTEN($0), B, C;
-D = JOIN C BY $1, B BY $1;
-</source>
-</section>
-</section> <!-- END OPTIMIZATION RULES -->
-
- <!-- MEMORY MANAGEMENT -->
-<section>
-<title>Memory Management</title>
-
-<p>Pig allocates a fix amount of memory to store bags and spills to disk as 
soon as the memory limit is reached. This is very similar to how Hadoop decides 
when to spill data accumulated by the combiner. </p>
-
-<p>The amount of memory allocated to bags is determined by 
pig.cachedbag.memusage; the default is set to 10% of available memory. Note 
that this memory is shared across all large bags used by the application.</p>
-
-</section> 
-<!-- END MEMORY MANAGEMENT  -->
 
+<!-- ==================================================================== -->
  
-  <!-- ZEBRA INTEGRATION -->
+<!-- ZEBRA INTEGRATION -->
 <section>
 <title>Zebra Integration</title>
 <p>For information about how to integrate Zebra with your Pig scripts, see <a 
href="zebra_pig.html">Zebra and Pig</a>.</p>
  </section> 
 <!-- END ZEBRA INTEGRATION  -->
  
- 
- 
  </body>
  </document>
   


Reply via email to