xdocs: cookbook.xml piglatin_ref2.xml udf.xml

olga Fri, 24 Sep 2010 18:01:08 -0700

Author: olga
Date: Sat Sep 25 01:00:37 2010
New Revision: 1001116

URL: http://svn.apache.org/viewvc?rev=1001116&view=rev
Log:
PIG-1600: Docs update (chandec via olgan)


Modified:
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=1001116&r1=1001115&r2=1001116&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml 
(original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml Sat 
Sep 25 01:00:37 2010
@@ -163,22 +163,23 @@ C = foreach B generate group, MyUDF(A);
 <section>
 <title>Implement the Aggregator Interface</title>
 <p>
-If your UDF can't be made Algebraic but is able to deal with getting input in 
chunks rather than all at once, consider implementing the Aggregator interface 
to reduce the amount of memory used by your script. If your function 
<em>is</em> Algebraic and can be used on conjunction with Accumulator 
functions, you will need to implement the Accumulator interface as well as the 
Algebraic interface. For more information, see the Pig UDF Manual and <a 
href="udf.html#Accumulator+Interface">Accumulator Interface</a>.
+If your UDF can't be made Algebraic but is able to deal with getting input in 
chunks rather than all at once, consider implementing the Aggregator interface 
to reduce the amount of memory used by your script.If your function <em>is</em> 
Algebraic and can be used on conjunction with Accumulator functions, you will 
need to implement the Accumulator interface as well as the Algebraic interface. 
For more information, see the Pig UDF Manual and <a 
href="udf.html#Accumulator+Interface">Accumulator Interface</a>.
 </p>
 </section>
 
 
 <section>
 <title>Drop Nulls Before a Join</title>
-<p>With the introduction of nulls, join and cogroup semantics were altered to 
work with nulls.  The semantic for cogrouping with nulls is that nulls from a 
given input are grouped together, but nulls across inputs are not grouped 
together.  This preserves the semantics of grouping (nulls are collected 
together from a single input to be passed to aggregate functions like COUNT) 
and the semantics of join (nulls are not joined across inputs).  Since 
flattening an empty bag results in an empty row, in a standard join the rows 
with a null key will always be dropped.  The join:  </p>
+<p>With the introduction of nulls, join and cogroup semantics were altered to 
work with nulls. The semantic for cogrouping with nulls is that nulls from a 
given input are grouped together, but nulls across inputs are not grouped 
together. This preserves the semantics of grouping (nulls are collected 
together from a single input to be passed to aggregate functions like COUNT) 
and the semantics of join (nulls are not joined across inputs). Since 
flattening an empty bag results in an empty row (and no output), in a standard 
join the rows with a null key will always be dropped. </p>
 
+<p>This join</p>
 <source>
 A = load 'myfile' as (t, u, v);
 B = load 'myotherfile' as (x, y, z);
 C = join A by t, B by x;
 </source>
-<p>is rewritten by pig to </p>
 
+<p>is rewritten by Pig to </p>
 <source>
 A = load 'myfile' as (t, u, v);
 B = load 'myotherfile' as (x, y, z);
@@ -186,8 +187,9 @@ C1 = cogroup A by t INNER, B by x INNER;
 C = foreach C1 generate flatten(A), flatten(B);
 </source>
 
-<p>Since the nulls from A and B won't be collected together, when the nulls 
are flattened we're guaranteed to have an empty bag, which will result in no 
output.  So the null keys will be dropped.  But they will not be dropped until 
the last possible moment.  If the query is rewritten to </p>
+<p>Since the nulls from A and B won't be collected together, when the nulls 
are flattened we're guaranteed to have an empty bag, which will result in no 
output. So the null keys will be dropped. But they will not be dropped until 
the last possible moment. </p> 
 
+<p>If the query is rewritten to </p>
 <source>
 A = load 'myfile' as (t, u, v);
 B = load 'myotherfile' as (x, y, z);

Modified: 
hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?rev=1001116&r1=1001115&r2=1001116&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml 
(original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml 
Sat Sep 25 01:00:37 2010
@@ -227,7 +227,7 @@ Also, be sure to review the information 
          
          <row>
             <entry> <para>-- O </para> </entry>
-            <entry> <para>or, order, outer, output</para> </entry>
+            <entry> <para>onschema, or, order, outer, output</para> </entry>
          </row>  
          
          <row>
@@ -282,7 +282,7 @@ Also, be sure to review the information 
 
 <!-- RELATIONS, BAGS, TUPLES, FIELDS-->
    
-   <section>
+   <section id ="relations">
    <title>Relations, Bags, Tuples, Fields</title>
       <para><ulink url="piglatin_ref1.html#Pig+Latin+Statements">Pig Latin 
statements</ulink> work with relations. A relation can be defined as 
follows:</para>
    <itemizedlist>
@@ -1117,7 +1117,8 @@ dump X;
    
    <section id="nulls_join">
    <title>Nulls and JOIN Operator</title>
-   <para>The JOIN operator - when performing inner joins - adheres to the SQL 
standard and disregards (filters out) null values.</para>
+   <para>The JOIN operator - when performing inner joins - adheres to the SQL 
standard and disregards (filters out) null values. 
+   (See also <ulink url="cookbook.html#Drop+Nulls+Before+a+Join">Drop Nulls 
Before a Join</ulink>.)</para>
 <programlisting>
 A = load 'student' as (name:chararray, age:int, gpa:float);
 B = load 'student' as (name:chararray, age:int, gpa:float);
@@ -1444,17 +1445,21 @@ A = LOAD 'data' AS (f1:int, f2:int);
 </programlisting>   
    </section>
    
-   <section>
+   <section id="schemaforeach">
    <title>Schemas with FOREACH Statements</title>
    <para>With FOREACH statements, the schema following the AS keyword must be 
enclosed in parentheses when the FLATTEN operator is used. Otherwise, the 
schema should not be enclosed in parentheses.</para>
    <para>In this example the FOREACH statement includes FLATTEN and a schema 
for simple data types.</para>
 <programlisting>
-X = FOREACH C GENERATE FLATTEN(B) AS (f1:int, f2:int, f3:int);
+X = FOREACH C GENERATE FLATTEN(B) AS (f1:int, f2:int, f3:int), group;
 </programlisting>  
-   <para>In this example the FOREACH statement includes a schema for simple 
data types.</para>
+   <para>In this example the FOREACH statement includes a schema for simple 
expression.</para>
 <programlisting>
 X = FOREACH A GENERATE f1+f2 AS x1:int;
 </programlisting>   
+   <para>In this example the FOREACH statement includes a schemas for multiple 
fields.</para>
+<programlisting>
+X = FOREACH A GENERATE f1 as user, f2 as age, f3 as gpa;
+</programlisting> 
    </section>
    
    <section>
@@ -4244,6 +4249,18 @@ B = FOREACH A GENERATE -x, y;
    For example, consider a relation that has a tuple of the form (a, {(b,c), 
(d,e)}), commonly produced by the GROUP operator. 
    If we apply the expression GENERATE $0, flatten($1) to this tuple, we will 
create new tuples: (a, b, c) and (a, d, e).</para>
    
+   <para>Also note that the flatten of empty bag will result in that row being 
discarded; no output is generated. 
+   (See also <ulink url="cookbook.html#Drop+Nulls+Before+a+Join">Drop Nulls 
Before a Join</ulink>.)</para>
+   
+   <programlisting>
+grunt> cat empty.bag
+{}      1
+grunt> A = LOAD 'empty.bag' AS (b : bag{}, i : int);
+grunt> B = FOREACH A GENERATE flatten(b), i;
+grunt> DUMP B;
+grunt>
+</programlisting>
+   
    <para>For examples using the FLATTEN operator, see <ulink 
url="#FOREACH">FOREACH</ulink>.</para>
    </section>
 
@@ -4864,7 +4881,7 @@ dump E; 
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = CROSS alias, alias [, alias â¦] [PARALLEL 
n];</para>
+               <para>alias = CROSS alias, alias [, alias â¦] [PARTITION BY 
partitioner] [PARALLEL n];</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -4872,7 +4889,8 @@ dump E; 
    <section>
    <title>Terms</title>
    <informaltable frame="all">
-      <tgroup cols="2"><tbody><row>
+      <tgroup cols="2"><tbody>
+      <row>
             <entry>
                <para>alias</para>
             </entry>
@@ -4880,6 +4898,22 @@ dump E; 
                <para>The name of a relation. </para>
             </entry>
          </row>
+               <row>
+            <entry>
+               <para>PARTITION BY partitioner</para>
+            </entry>
+            <entry>
+             <para>Use this feature to specify the Hadoop Partitioner. The 
partitioner controls the partitioning of the keys of the intermediate 
map-outputs. </para>
+             <itemizedlist>
+             <listitem>
+             <para>For more details, see 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html</para>
+             </listitem>
+             <listitem>
+             <para>For usage, see <xref linkend="partitionby" /></para>
+             </listitem>
+             </itemizedlist>
+            </entry>
+         </row>
          <row>
             <entry>
                <para>PARALLEL n</para>
@@ -4946,7 +4980,7 @@ DUMP X;
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = DISTINCT alias [PARALLEL n];Â  Â  Â  Â  </para>
+               <para>alias = DISTINCT alias [PARTITION BY partitioner] 
[PARALLEL n];Â  Â  Â  Â  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -4962,6 +4996,24 @@ DUMP X;
                <para>The name of the relation.</para>
             </entry>
          </row>
+      
+      <row>      
+         <entry>
+               <para>PARTITION BY partitioner</para>
+            </entry>
+            <entry>
+             <para>Use this feature to specify the Hadoop Partitioner. The 
partitioner controls the partitioning of the keys of the intermediate 
map-outputs. </para>
+             <itemizedlist>
+             <listitem>
+             <para>For more details, see 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html</para>
+             </listitem>
+             <listitem>
+              <para>For usage, see <xref linkend="partitionby" />.</para>
+             </listitem>
+             </itemizedlist>
+         </entry>
+     </row> 
+         
          <row>
             <entry>
                <para>PARALLEL n</para>
@@ -4983,7 +5035,7 @@ DUMP X;
    
    <section>
    <title>Usage</title>
-   <para>Use the DISTINCT operator to remove duplicate tuples in a relation. 
DISTINCT does not preserve the original order of the contents (to eliminate 
duplicates, Pig must first sort the data). You cannot use DISTINCT on a subset 
of fields. To do this, use FOREACH â¦ GENERATE to select the fields, and then 
use DISTINCT.</para></section>
+   <para>Use the DISTINCT operator to remove duplicate tuples in a relation. 
DISTINCT does not preserve the original order of the contents (to eliminate 
duplicates, Pig must first sort the data). You cannot use DISTINCT on a subset 
of fields. To do this, use FOREACHâ¦GENERATE to select the fields, and then 
use DISTINCT.</para></section>
    
    <section>
    <title>Example</title>
@@ -5058,7 +5110,7 @@ DUMP X;
    
    <section>
    <title>Usage</title>
-   <para>Use the FILTER operator to work with tuples or rows of data (if you 
want to work with columns of data, use the FOREACH â¦GENERATE 
operation).</para>
+   <para>Use the FILTER operator to work with tuples or rows of data (if you 
want to work with columns of data, use the FOREACH...GENERATE operation).</para>
    <para>FILTER is commonly used to select the data that you want; or, 
conversely, to filter out (remove) the data you donât want.</para></section>
    
    <section>
@@ -5108,7 +5160,7 @@ DUMP X;
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias Â = FOREACH { gen_blk | nested_gen_blk } [AS 
schema];</para>
+               <para>alias Â = FOREACH { gen_blk | nested_gen_blk };</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5129,9 +5181,10 @@ DUMP X;
                <para>gen_blk</para>
             </entry>
             <entry>
-               <para>FOREACH â¦ GENERATE used with a relation (outer bag). 
Use this syntax:</para>
+               <para>FOREACHâ¦GENERATE used with a relation (outer bag). Use 
this syntax:</para>
                <para/>
-               <para>alias = FOREACH alias GENERATE expression [expression 
â¦.]</para>
+               <para>alias = FOREACH alias GENERATE expression [AS schema] 
[expression [AS schema]â¦.];</para>
+               <para>See <xref linkend="schemaforeach"/></para>
             </entry>
          </row>
          <row>
@@ -5139,16 +5192,17 @@ DUMP X;
                <para>nested_gen_blk</para>
             </entry>
             <entry>
-               <para>FOREACH â¦ GENERATE used with a inner bag. Use this 
syntax:</para>
+               <para>FOREACH...GENERATE used with a inner bag. Use this 
syntax:</para>
                <para/>
                <para>alias = FOREACH nested_alias {</para>
                <para>Â  Â alias = nested_op; [alias = nested_op; â¦]</para>
-               <para>Â  Â GENERATE expression [, expression â¦]</para>
+               <para>Â  Â GENERATE expression [AS schema] [expression [AS 
schema]â¦.]</para>
                <para>};</para>
                <para/>
                <para>Where:</para>
                <para>The nested block is enclosed in opening and closing 
brackets { â¦ }. </para>
                <para>The GENERATE keyword must be the last statement within 
the nested block.</para>
+               <para>See <xref linkend="schemaforeach"/></para>
             </entry>
          </row>
          <row>
@@ -5172,8 +5226,9 @@ DUMP X;
                <para>nested_op</para>
             </entry>
             <entry>
-               <para>Allowed operations are DISTINCT, FILTER, LIMIT, ORDER and 
SAMPLE. </para>
-               <para>The FOREACH â¦ GENERATE operation itself is not allowed 
since this could lead to an arbitrary number of nesting levels.</para>
+               <para>Allowed operations are DISTINCT, FILTER, LIMIT, and 
ORDER. </para>
+               <para>The FOREACHâ¦GENERATE operation itself is not allowed 
since this could lead to an arbitrary number of nesting levels.</para>
+               <para>You can also perform projections (see <xref 
linkend="nestedblock" />).</para>
             </entry>
          </row>
          <row>
@@ -5181,7 +5236,7 @@ DUMP X;
                <para>AS</para>
             </entry>
             <entry>
-               <para>Keyword.</para>
+               <para>Keyword</para>
             </entry>
          </row>
          <row>
@@ -5204,9 +5259,9 @@ DUMP X;
    
    <section>
    <title>Usage</title>
-   <para>Use the FOREACH â¦GENERATE operation to work with columns of data 
(if you want to work with tuples or rows of data, use the FILTER 
operation).</para>
+   <para>Use the FOREACHâ¦GENERATE operation to work with columns of data (if 
you want to work with tuples or rows of data, use the FILTER operation).</para>
   
-   <para>FOREACH â¦GENERATE works with relations (outer bags) as well as 
inner bags:</para>
+   <para>FOREACH...GENERATE works with relations (outer bags) as well as inner 
bags:</para>
    <itemizedlist>
       <listitem>
          <para>If A is a relation (outer bag), a FOREACH statement could look 
like this.</para>
@@ -5404,21 +5459,21 @@ DUMP X;
 
    <para>Another FLATTEN example. Here, relations A and B both have a column 
x. When forming relation E,  you need to use the :: operator to identify which 
column x to use - either relation A column x (A::x) or relation B column x 
(B::x). This example uses relation A column x (A::x).</para>
 <programlisting>
-A = load 'data' as (x, y);
-B = load 'data' as (x, z);
-C = cogroup A by x, B by x;
-D = foreach C generate flatten(A), flatten(b);
-E = group D by A::x;
+A = LOAD 'data' AS (x, y);
+B = LOAD 'data' AS (x, z);
+C = COGROUP A BY x, B BY x;
+D = FOREACH C GENERATE flatten(A), flatten(b);
+E = GROUP D BY A::x;
 â¦â¦
 </programlisting>
    
    </section>
    
-   <section>
+   <section id="nestedblock">
    <title>Example: Nested Block</title>
    <para>Suppose we have relations A and B. Note that relation B contains an 
inner bag.</para>
 <programlisting>
-A = LOAD 'data' AS (url:chararray,outline:chararray);
+A = LOAD 'data' AS (url:chararray,outlink:chararray);
 
 DUMP A;
 (www.ccc.com,www.hjk.com)
@@ -5437,21 +5492,23 @@ DUMP B;
 (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
 </programlisting>
    
-   <para>In this example we perform two of the operations allowed in a nested 
block, FILTER and DISTINCT. Note that the last statement in the nested block 
must be GENERATE.</para>
+   <para>In this example we perform two of the operations allowed in a nested 
block, FILTER and DISTINCT. Note that the last statement in the nested block 
must be GENERATE. Also, note the use of projection (PA = FA.outlink;).</para>
 <programlisting>
-X = foreach B {
+X = FOREACH B {
         FA= FILTER A BY outlink == 'www.xyz.org';
         PA = FA.outlink;
         DA = DISTINCT PA;
-        GENERATE GROUP, COUNT(DA);
+        GENERATE group, COUNT(DA);
 }
 
 DUMP X;
-(www.ddd.com,1L)
-(www.www.com,1L)
+(www.aaa.com,0)
+(www.ccc.com,0)
+(www.ddd.com,1)
+(www.www.com,1)
 </programlisting>
    
-   </section></section>
+</section></section>
    
    <section id="GROUP">
    <title>GROUP</title>
@@ -5464,7 +5521,7 @@ DUMP X;
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL 
| BY expression â¦]  [USING 'collected' | 'merge'] [PARALLEL n];</para>
+               <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL 
| BY expression â¦]  [USING 'collected' | 'merge'] [PARTITION BY partitioner] 
[PARALLEL n];</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5576,6 +5633,23 @@ DUMP X;
             </entry>
             
          </row>     
+         
+     <row>      
+         <entry>
+               <para>PARTITION BY partitioner</para>
+            </entry>
+            <entry>
+             <para>Use this feature to specify the Hadoop Partitioner. The 
partitioner controls the partitioning of the keys of the intermediate 
map-outputs. </para>
+             <itemizedlist>
+             <listitem>
+             <para>For more details, see 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html</para>
+             </listitem>
+             <listitem>
+             <para>For usage, see <xref linkend="partitionby" /></para>
+             </listitem>
+             </itemizedlist>
+         </entry>
+     </row> 
 
          <row>
             <entry>
@@ -5809,6 +5883,30 @@ DUMP F;
  C = COGROUP A BY id, B BY id USING 'merge';
 </programlisting>
     </section>
+    
+   <section id="partitionby">
+   <title>Example: PARTITION BY</title>
+<para>To use the Hadoop Partitioner add PARTITION BY clause to the appropriate 
operator: </para>
+<programlisting>
+A = LOAD 'input_data'; 
+B = GROUP A BY $0 PARTITION BY 
org.apache.pig.test.utils.SimpleCustomPartitioner PARALLEL 2;
+</programlisting>
+<para>Here is the code for SimpleCustomPartitioner:</para>
+<programlisting>
+public class SimpleCustomPartitioner extends Partitioner 
&lt;PigNullableWritable, Writable&gt; { 
+     //@Override 
+    public int getPartition(PigNullableWritable key, Writable value, int 
numPartitions) { 
+        if(key.getValueAsPigType() instanceof Integer) { 
+            int ret = (((Integer)key.getValueAsPigType()).intValue() % 
numPartitions); 
+            return ret; 
+       } 
+       else { 
+            return (key.hashCode()) % numPartitions; 
+        } 
+    } 
+}
+</programlisting>
+   </section>
    </section>
    
    
@@ -5823,7 +5921,7 @@ DUMP F;
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = JOIN alias BY {expression|'('expression [, 
expression â¦]')'} (, alias BY {expression|'('expression [, expression 
â¦]')'} â¦) [USING 'replicated' | 'skewed' | 'merge'] [PARALLEL n];Â  </para>
+               <para>alias = JOIN alias BY {expression|'('expression [, 
expression â¦]')'} (, alias BY {expression|'('expression [, expression 
â¦]')'} â¦) [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY 
partitioner] [PARALLEL n];Â  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5891,6 +5989,25 @@ DUMP F;
             </entry>
          </row>
          
+              <row>      
+         <entry>
+               <para>PARTITION BY partitioner</para>
+            </entry>
+            <entry>
+             <para>Use this feature to specify the Hadoop Partitioner. The 
partitioner controls the partitioning of the keys of the intermediate 
map-outputs. </para>
+             <itemizedlist>
+             <listitem>
+             <para>For more details, see 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html</para>
+             </listitem>
+             <listitem>
+              <para>For usage, see <xref linkend="partitionby" /></para>
+             </listitem>
+             </itemizedlist>
+             <para></para>
+             <para>This feature CANNOT be used with skewed joins.</para>
+         </entry>
+     </row> 
+
          
          <row>
             <entry>
@@ -5982,7 +6099,7 @@ DUMP X;
       <tgroup cols="1"><tbody><row>
             <entry>
                <para>alias = JOIN left-alias BY left-alias-column 
[LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column 
-               [USING 'replicated' | 'skewed' | 'merge']  [PARALLEL n];Â  
</para>
+               [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY 
partitioner] [PARALLEL n];Â  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable>
@@ -6081,8 +6198,7 @@ DUMP X;
             </entry>
          </row>
 
-
-                  <row>
+         <row>
             <entry>
                <para>'merge'</para>
             </entry>
@@ -6090,6 +6206,25 @@ DUMP X;
                <para>Use to perform merge joins (see <ulink 
url="piglatin_ref1.html#Merge+Joins">Merge Joins</ulink>).</para>
             </entry>
          </row>
+         
+      <row>      
+         <entry>
+               <para>PARTITION BY partitioner</para>
+            </entry>
+            <entry>
+             <para>Use this feature to specify the Hadoop Partitioner. The 
partitioner controls the partitioning of the keys of the intermediate 
map-outputs. </para>
+             <itemizedlist>
+             <listitem>
+             <para>For more details, see 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html</para>
+             </listitem>
+             <listitem>
+              <para>For usage, see <xref linkend="partitionby" /></para>
+             </listitem>
+             </itemizedlist>
+             <para></para>
+             <para>This feature CANNOT be used with skewed joins.</para>
+         </entry>
+     </row> 
 
          <row>
             <entry>
@@ -6384,8 +6519,96 @@ ILLUSTRATE A;
       For examples of how to specify more complex schemas for use with the 
LOAD operator, see Schemas for Complex Data Types and Schemas for Multiple 
Types.
       </para></section></section>
       
+
+<section>
+   <title>MAPREDUCE</title>
+   <para>Executes native MapReduce jobs inside a Pig script.</para>      
+   
+   <section>
+   <title>Syntax</title>
+      <informaltable frame="all">
+      <tgroup cols="1"><tbody><row>
+            <entry>
+               <para>alias1 = MAPREDUCE 'mr1.jar' [('mr2.jar', ...)] STORE 
alias2 INTO 
+'inputLocation' USING storeFunc LOAD 'outputLocation' USING loadFunc AS schema 
[`params, ... `];</para>
+            </entry>
+         </row></tbody></tgroup>
+   </informaltable>
+   </section>
+   
+   <section>
+   <title>Terms</title>
+   <informaltable frame="all">
+      <tgroup cols="2"><tbody>
+      <row>
+            <entry>
+               <para>alias</para>
+            </entry>
+            <entry>
+               <para>The name of a relation.</para>
+            </entry>
+     </row>
+     <row>
+            <entry>
+               <para>mr.jar</para>
+            </entry>
+            <entry>
+               <para>Any MapReduce jar file which can be run through "hadoop 
jar mymr.jar params" command. Thus, the contract for inputLocation and 
outputLocation is typically managed through params. </para>
+            </entry>
+     </row>
+
+      <row>
+            <entry>
+               <para>STORE ... INTO ... USING</para>
+            </entry>
+            <entry>
+               <para>See <ulink 
url="piglatin_ref2.html#STORE">STORE</ulink></para>
+               <para>Store alias2 into the inputLocation using storeFunc, 
which is then used by native mapreduce to read its data.</para>
+                
+            </entry>
+     </row>
+
+      <row>
+            <entry>
+               <para>LOAD ... USING ... AS ....</para>
+            </entry>
+            <entry>
+               <para>See <ulink 
url="piglatin_ref2.html#LOAD">LOAD</ulink></para>
+               <para>After running mr1.jar's mapreduce, load back the data 
from outputLocation into alias1 using loadFunc as schema</para>
+            </entry>
+     </row>
+
+      <row>
+            <entry>
+               <para>'params, ...'</para>
+            </entry>
+            <entry>
+               <para>Extra parameters required for native MapReduce job. 
</para>
+            </entry>
+     </row>
+      </tbody></tgroup>
+</informaltable>
+</section>
+
+<section>
+<title>Usage</title>
+<para>Use the MAPREDUCE operator to run native MapReduce jobs from inside a 
Pig script.</para>
+</section>
+
+<section>
+<title>Example</title>
+<para>This example shows howto run the wordcount MapReduce progam from Pig.
+Note that the files specified as input and output locations in the MAPREDUCE 
statement will NOT be deleted by Pig automatically. You will need to delete 
them manually. </para>
+<programlisting>
+A = LOAD 'WordcountInput.txt';
+B = MAPREDUCE wordcount.jar STOE A INTO 'inputDir' LOAD 'outputDir' AS 
(word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
+</programlisting>
+</section>
+
+</section>
+      
       <section>
-      <title>ORDER</title>
+      <title>ORDER BY</title>
    <para>Sorts a relation based on one or more fields.</para>
    
    <section>
@@ -6411,18 +6634,18 @@ ILLUSTRATE A;
          </row>
          <row>
             <entry>
-               <para>BY</para>
+               <para>*</para>
             </entry>
             <entry>
-               <para>Required keyword.</para>
+               <para>The designator for a tuple.</para>
             </entry>
          </row>
-         <row>
+                  <row>
             <entry>
-               <para>*</para>
+               <para>field_alias</para>
             </entry>
             <entry>
-               <para>The designator for a tuple.</para>
+               <para>A field in the relation. The field must be a simple 
type.</para>
             </entry>
          </row>
          <row>
@@ -6441,14 +6664,7 @@ ILLUSTRATE A;
                <para>Sort in descending order.</para>
             </entry>
          </row>
-         <row>
-            <entry>
-               <para>field_alias</para>
-            </entry>
-            <entry>
-               <para>A field in the relation.</para>
-            </entry>
-         </row>
+
          <row>
             <entry>
                <para>PARALLEL n</para>
@@ -6470,19 +6686,28 @@ ILLUSTRATE A;
    
    <section>
    <title>Usage</title>
-   <para>In Pig, relations are unordered (see Relations, Bags, Tuples, and 
Fields):</para>
+   <para>In Pig, relations are unordered (see <xref linkend="relations" 
/>):</para>
    <itemizedlist>
       <listitem>
-         <para>If you order relation A to produce relation X (X = ORDER A BY * 
DESC;) relations A and X still contain the same thing. </para>
+         <para>If you order relation A to produce relation X (X = ORDER A BY * 
DESC;) relations A and X still contain the same data. </para>
       </listitem>
       <listitem>
-         <para>If you retrieve the contents of relation X (DUMP X;) they are 
guaranteed to be in the order you specified (descending).</para>
+         <para>If you retrieve relation X (DUMP X;) the data is guaranteed to 
be in the order you specified (descending).</para>
       </listitem>
       <listitem>
-         <para>However, if you further process relation X (Y = FILTER X BY $0 
&gt; 1;) there is no guarantee that the contents will be processed in the order 
you originally specified (descending).</para>
+         <para>However, if you further process relation X (Y = FILTER X BY $0 
&gt; 1;) there is no guarantee that the data will be processed in the order you 
originally specified (descending).</para>
       </listitem>
-   </itemizedlist></section>
-   
+   </itemizedlist>
+   <para></para>
+      <para>Pig currently supports ordering on fields with simple types or by 
tuple designator (*). You cannot order on fields with complex types or by 
expressions. </para>
+     <programlisting>
+A = LOAD 'mydata' AS (x: int, y: map[]);     
+B = ORDER A BY x; -- this is allowed because x is a simple type
+B = ORDER A BY y; -- this is not allowed because y is a complex type
+B = ORDER A BY y#'id'; -- this is not allowed because y#'id' is an expression
+</programlisting> 
+   </section>
+
    <section>
    <title>Examples</title>
    <para>Suppose we have relation A.</para>
@@ -6980,7 +7205,7 @@ X = STREAM A THROUGH 'stream.pl' as (f1:
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = UNION alias, alias [, alias â¦];</para>
+               <para>alias = UNION [ONSCHEMA] alias, alias [, alias 
â¦];</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -6988,14 +7213,35 @@ X = STREAM A THROUGH 'stream.pl' as (f1:
    <section>
    <title>Terms</title>
    <informaltable frame="all">
-      <tgroup cols="2"><tbody><row>
+      <tgroup cols="2"><tbody>
+      <row>
             <entry>
                <para>alias</para>
             </entry>
             <entry>
                <para>The name of a relation.</para>
             </entry>
-         </row></tbody></tgroup>
+      </row>
+      
+      <row>
+             <entry>
+               <para>ONSCHEMA </para>  
+            </entry>
+            <entry>
+               <para>Use the keyword ONSCHEMA with UNION so that the union is 
based on column names of the input relations, and not column position. 
+If the following requirements are not met, the statement will throw an 
error:</para>
+          <itemizedlist>
+             <listitem>
+             <para>All inputs to the union should have a non null schema. 
</para>
+             </listitem>
+             <listitem>
+             <para>The data type for columns with same name in different input 
schemas should be compatible. Numeric types are compatible, and if column 
having same name in different input schemas have different numeric types , an 
implicit conversion will happen. bytearray type is considered compatible with 
all other types, a cast will be added to convert to other type. Bags or tuples 
having different inner schema are considered incompatible. 
+</para>
+             </listitem>
+           </itemizedlist>
+            </entry>
+         </row>
+         </tbody></tgroup>
    </informaltable></section>
    
    <section>
@@ -7039,7 +7285,37 @@ DUMP X;
 (8,9)
 (1,3)
 </programlisting>
-   </section></section></section>
+   </section>
+   
+   <section>
+   <title>Example</title>
+   <para>This example shows the use of ONSCHEMA.</para>
+<programlisting>
+L1 = LOAD 'f1' USING (a : int, b : float);
+DUMP L1;
+(11,12.0)
+(21,22.0)
+
+L2 = LOAD  'f1' USING (a : long, c : chararray);
+DUMP L2;
+(11,a)
+(12,b)
+(13,c)
+
+U = UNION ONSCHEMA L1, L2;
+DESCRIBE U ;
+U : {a : long, b : float, c : chararray}
+
+DUMP U;
+(11,12.0,)
+(21,22.0,)
+(11,,a)
+(12,,b)
+(13,,c)
+</programlisting>
+</section>
+   
+   </section></section>
    
 
 <!-- 
=========================================================================== -->
@@ -8085,7 +8361,9 @@ DUMP X;
    COUNT requires a preceding GROUP ALL statement for global counts and a 
GROUP BY statement for group counts.</para>
 
    <para>
-    The COUNT function ignores NULL values.  If you want to include NULL 
values in the count computation, use 
+    The COUNT function follows syntax semantics and ignores nulls. 
+    What this means is that a tuple in the bag will not be counted if the 
<emphasis>first field</emphasis> in this tuple is NULL. 
+    If you want to include NULL values in the count computation, use 
     <ulink url="#COUNT_STAR">COUNT_STAR</ulink>.
    </para>   
    
@@ -8097,7 +8375,7 @@ DUMP X;
    
    <section>
    <title>Example</title>
-   <para>In this example the tuples in the bag are counted (see the GROUP 
operator for information about the field names in relation B).</para>
+   <para>In this example the tuples in the bag are counted (see the <ulink 
url="#GROUP">GROUP</ulink> operator for information about the field names in 
relation B).</para>
 <programlisting>
 A = LOAD 'data' AS (f1:int,f2:int,f3:int);
 
@@ -11543,7 +11821,7 @@ grunt&gt; rmf students students_sav
    The fs command greatly extends the set of supported file system commands 
and the capabilities
    supported for existing commands such as ls that will now support globing. 
For a complete list of
    FSShell commands, see 
-   <ulink 
url="http://hadoop.apache.org/common/docs/current/hdfs_shell.html";>HDFS File 
System Shell Guide</ulink></para>
+   <ulink 
url="http://hadoop.apache.org/common/docs/current/file_system_shell.html";>File 
System Shell Guide</ulink></para>
    </section>
    
    <section>
@@ -11555,6 +11833,70 @@ fs -copyFromLocal file-x file-y
 fs -ls file-y
 </programlisting>
    </section>
+       </section>  
+     <section>
+   <title>sh</title>
+   <para>Invokes any sh shell command from within a Pig script or the Grunt 
shell.</para>
+   
+   <section>
+   <title>Syntax </title>
+   <informaltable frame="all">
+      <tgroup cols="1"><tbody><row>
+            <entry>
+               <para>sh subcommand subcommand_parameters </para>
+            </entry>
+         </row></tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Terms</title>
+   <informaltable frame="all">
+      <tgroup cols="2">
+      <tbody>
+      <row>
+            <entry>
+               <para>subcommand</para>
+            </entry>
+            <entry>
+               <para>The sh shell command.</para>
+            </entry>
+         </row>
+               <row>
+            <entry>
+               <para>subcommand_parameters</para>
+            </entry>
+            <entry>
+               <para>The sh shell command parameters.</para>
+            </entry>
+         </row>
+         </tbody>
+         </tgroup>
+   </informaltable>
+   
+   </section>
+   
+   <section>
+   <title>Usage</title>
+   <para>Use the sh command to invoke any sh shell command from within a Pig 
script or Grunt shell.</para>
+   
+<para> 
+ Note that only real programs can be run form the sh command. Commands such as 
cd are not programs 
+ but part of the shell environment and as such cannot be executed unless the 
user invokes the shell explicitly, like "bash cd".
+</para>
+   </section>
+   
+   <section>
+   <title>Example</title>
+   <para>In this example the ls command is invoked.</para>
+<programlisting>
+grunt> sh ls 
+bigdata.conf 
+nightly.conf 
+..... 
+grunt> 
+</programlisting>
+   </section>
+ 
     </section>
         </section>
  
@@ -12000,12 +12342,10 @@ C = FOREACH B GENERATE group, COUNT(A.t)
 D = ORDER C BY mycount;
 STORE D INTO 'mysortedcount' USING PigStorage();
 </programlisting>
+</section>
+</section>
 
-</section></section>
-   
-   
    </section>
-  
    </article>
   
    

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml?rev=1001116&r1=1001115&r2=1001116&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml Sat Sep 
25 01:00:37 2010
@@ -143,7 +143,7 @@ DUMP C;
 
 <p>An aggregate function is an eval function that takes a bag and returns a 
scalar value. One interesting and useful property of many aggregate functions 
is that they can be computed incrementally in a distributed fashion. We call 
these functions <code>algebraic</code>. <code>COUNT</code> is an example of an 
algebraic function because we can count the number of elements in a subset of 
the data and then sum the counts to produce a final output. In the Hadoop 
world, this means that the partial computations can be done by the map and 
combiner, and the final result can be computed by the reducer. </p>
 
-<p>It is very important for performance to make sure that aggregate functions 
that are algebraic are implemented as such. Let's look at the implementation of 
the COUNT function to see what this means. (Error handling and some other code 
is omitted to save space. The full code can be accessed <a 
href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup";>
 here</a>.</p>
+<p>It is very important for performance to make sure that aggregate functions 
that are algebraic are implemented as such. Let's look at the implementation of 
the COUNT function to see what this means. (Error handling and some other code 
is omitted to save space. The full code can be accessed <a 
href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup";>
 here</a>.)</p>
 
 <source>
 public class COUNT extends EvalFunc&lt;Long&gt; implements Algebraic{
@@ -1231,6 +1231,16 @@ public class IntMax extends EvalFunc&lt;
 <title>Advanced Topics</title>
 
 <section>
+<title>UDF Interfaces</title>
+<p>A UDF can be invoked multiple ways. The simplest UDF can just extend 
EvalFunc, which requires only the exec function to be implemented (see <a 
href="#How+to+Write+a+Simple+Eval+Function"> How to Write a Simple Eval 
Function</a>). Every eval UDF must implement this. Additionally, if a function 
is algebraic, it can implement <code>Algebraic</code> interface to 
significantly improve query performance in the cases when combiner can be used 
(see <a href="#Aggregate+Functions">Aggregate Functions</a>). Finally, a 
function that can process tuples in an incremental fashion can also implement 
the Accumulator interface to improve query memory consumption (see <a 
href="#Accumulator+Interface">Accumulator Interface</a>).
+</p>
+
+<p>The exact method by which UDF is invoked is selected by the optimizer based 
on the UDF type and the query. Note that only a single interface is used at any 
given time. The optimizer tries to find the most efficient way to execute the 
function. If a combiner is used and the function implements the Algebraic 
interface then this interface will be used to invoke the function. If the 
combiner is not invoked but the accumulator can be used and the function 
implements Accumulator interface then that interface is used. If neither of the 
conditions is satisfied then the exec function is used to invoke the UDF. 
+</p>
+ </section>
+ 
+ 
+<section>
 <title>Function Instantiation</title>
 
 <p>One problem that users run into is when they make assumption about how many 
times a constructor for their UDF is called. For instance, they might be 
creating side files in the store function and doing it in the constructor seems 
like a good idea. The problem with this approach is that in most cases Pig 
instantiates functions on the client side to, for instance, examine the schema 
of the data.  </p>
@@ -1241,7 +1251,7 @@ public class IntMax extends EvalFunc&lt;
 <section>
 <title>Schemas</title>
 
-<p>One request from users is to have the ability to examine the input schema 
of the data before processing the data. For example, they would like to know 
how to convert an input tuple to a map such that the keys in the map are the 
names of the input columns. The current answer is that there is now way to do 
this. This is something we would like to support in the future. </p>
+<p>One request from users is to have the ability to examine the input schema 
of the data before processing the data. For example, they would like to know 
how to convert an input tuple to a map such that the keys in the map are the 
names of the input columns. The current answer is that there is no way to do 
this. This is something we would like to support in the future. </p>
 
  </section>
  
@@ -1298,6 +1308,131 @@ public class IntMax extends EvalFunc&lt;
 </section>
 </section>
 
+<section>
+<title>Python UDFs</title>
+<section>
+<title>Syntax</title>
+<section>
+<title>Registering Scripts</title>
+<p>You can register a Python script as shown here. This example uses 
org.apache.pig.scripting.jython.JythonScriptEngine to interpret the python 
script. You can develop and use custom script engines to support multiple 
programming languages and ways to interpret them. Currently, Pig identifies 
jython as a keyword and ships the required scriptengine (jython) to interpret 
it. </p>
+<source>
+Register 'test.py' using jython as myfuncs;
+</source>
+
+<p>The following syntax is also supported, where myfuncs is the namespace 
created for all the functions inside test.py.</p>
+<source>
+register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine as 
myfuncs;
+</source>
+
+<p>A typical test.py looks like this:</p>
+<source>
+...@outputschema("x:{t:(word:chararray)}")
+def helloworld():  
+  return ('Hello, World')
+
+...@outputschema("y:{t:(word:chararray,num:long)}")
+def complex(word):  
+  return (str(word),long(word)*long(word))
+
+...@outputschemafunction("squareSchema")
+def square(num):
+  return ((num)*(num))
+
+...@schemafunction("squareSchema")
+def squareSchema(input):
+  return input
+
+# No decorator - bytearray
+def concat(str):
+  return str+str
+</source>
+
+<p>The register statement above registers the python functions defined in 
test.py in Pigâs runtime within the defined namespace (myfuncs here). They 
can then be referred later on in the pig script as myfuncs.helloworld(), 
myfuncs.complex(), and myfuncs.square(). An example usage is:</p>
+
+<source>
+b = foreach a generate myfuncs.helloworld(), myfuncs.square(3);
+</source>
+
+</section>
+<section>
+<title>Decorators and Schemas</title>
+<p>To annotate a python script so that Pig can identify return types, use 
python decorators to define output schema for the script UDF. </p>
+<ul>
+<li>outputSchema - Defines schema for a script udf in a format that Pig 
understands and is able to parse. </li>
+<li>outputFunctionSchema - Defines a script delegate function that defines 
schema for this function depending upon the input type. This is needed for 
functions that can accept generic types and perform generic operations on these 
types. A simple example is square which can accept multiple types. 
SchemaFunction for this type is a simple identity function (same schema as 
input). </li>
+<li>schemaFunction - Defines delegate function and is not registered to Pig. 
</li>
+</ul>
+
+<p>When no decorator is specified, Pig assumes the output datatype as 
bytearray and converts the output generated by script function to bytearray. 
This is consistent with Pig's behavior in case of Java UDFs. </p>
+
+<p>Sample Schema String - y:{t:(word:chararray,num:long)}, variable names 
inside a schema string are not used anywhere, they just make the syntax 
identifiable to the parser. </p>
+</section>
+</section>
+
+<section>
+<title>Inline Scripts</title>
+<p>Currently, Pig does not support UDFs using inline scripts. </p>
+</section>
+
+<section>
+<title>Sample Script UDFs</title>
+<p>Simple tasks like string manipulation, mathematical computations, and 
reorganizing data types can be easily done using python scripts without having 
to develop long and complex UDFs in Java. The overall overhead of using 
scripting language is much less and development cost is almost negligible. The 
following UDFs, developed in python, can be used with Pig. </p>
+
+<source>
+mySampleLib.py
+---------------------
+#!/usr/bin/python
+
+##################
+# Math functions #
+##################
+#Square - Square of a number of any data type
+...@outputschemafunction("squareSchema")
+def square(num):
+  return ((num)*(num))
+...@schemafunction("squareSchema")
+def squareSchema(input):
+  return input
+
+#Percent- Percentage
+...@outputschema("t:(percent:double)")
+def percent(num, total):
+  return num * 100 / total
+
+####################
+# String Functions #
+####################
+#commaFormat- format a number with commas, 12345-> 12,345
+...@outputschema("t:(numformat:chararray)")
+def commaFormat(num):
+  return '{:,}'.format(num)
+
+#concatMultiple- concat multiple words
+...@outputschema("t:(numformat:chararray)")
+def concatMult4(word1, word2, word3, word4):
+  return word1+word2+word3+word4
+
+#######################
+# Data Type Functions #
+#######################
+#collectBag- collect elements of a bag into other bag
+#This is useful UDF after group operation
+...@outputschema("bag:{(y:{t:(word:chararray)}}")
+def collectBag(bag):
+  outBag = []
+  for word in bag:
+    tup=(len(bag), word[1])
+    outBag.append(tup)
+  return outBag
+
+# Few comments- 
+# pig mandates that a bag should be a bag of tuples, python UDFs should follow 
this pattern.
+# tuple in python are immutable, appending to a tuple is not possible.
+</source>
+</section>
+
+</section>
+
 </body>
 </document>

svn commit: r1001116 - in /hadoop/pig/trunk/src/docs/src/documentation/content/xdocs: cookbook.xml piglatin_ref2.xml udf.xml

Reply via email to