svn commit: r937032 - in /hadoop/pig/branches/branch-0.7: ./ src/docs/src/documentation/content/xdocs/

olga Thu, 22 Apr 2010 12:35:55 -0700

Author: olga
Date: Thu Apr 22 19:35:03 2010
New Revision: 937032

URL: http://svn.apache.org/viewvc?rev=937032&view=rev
Log:
PIG-1320: final documentation updates for Pig 0.7.0 (chandec via olgan)


Modified:
    hadoop/pig/branches/branch-0.7/CHANGES.txt
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml
    
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml

Modified: hadoop/pig/branches/branch-0.7/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/CHANGES.txt?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- hadoop/pig/branches/branch-0.7/CHANGES.txt (original)
+++ hadoop/pig/branches/branch-0.7/CHANGES.txt Thu Apr 22 19:35:03 2010
@@ -68,6 +68,8 @@ manner (rding via pradeepkth)
 
 IMPROVEMENTS
 
+PIG-1320: final documentation updates for Pig 0.7.0 (chandec via olgan)
+
 PIG-1330: Move pruned schema tracking logic from LoadFunc to core code (daijy)
 
 PIG-1320: more documentation updates for Pig 0.7.0 (chandec via olgan)

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml
 Thu Apr 22 19:35:03 2010
@@ -25,7 +25,6 @@
 </header>
 <body>
  
- 
  <!-- ABOUT PIG LATIN -->
    <section>
    <title>Overview</title>
@@ -55,25 +54,46 @@
   
    <section>
    <title>Running Pig Latin </title>
-   <p>You can execute Pig Latin statements interactively or in batch mode 
using Pig scripts (see the <a href="piglatin_ref2.html#exec">exec</a> and <a 
href="piglatin_ref2.html#run">run</a> commands).</p>
    
-   <p>Grunt Shell, Interactive or Batch Mode</p>
+   <p>You can execute Pig Latin statements: </p>
+   <ul>
+   <li>Using grunt shell or command line</li>
+    <li>In mapreduce mode or local mode</li>
+    <li>Either interactively or in batch </li>
+   </ul>
+   
+   
+   <p></p>
+<p>Note that Pig now uses Hadoop's local mode (rather than Pig's native local 
mode).</p>   
+<p>A few run examples are shown here; see <a href="setup.html">Pig Setup</a> 
for more examples.</p>   
+   
+
+   <p>Grunt Shell - interactive, mapreduce mode (because mapreduce mode is the 
default you do not need to specify)</p>
    <source>
 $ pig 
 ... - Connecting to ...
 grunt> A = load 'data';
 grunt> B = ... ;
-or
+</source> 
+
+   <p>Grunt Shell - batch, local mode (see the <a 
href="piglatin_ref2.html#exec">exec</a> and <a 
href="piglatin_ref2.html#run">run</a> commands)</p>
+   <source>
+$ pig -x local
 grunt> exec myscript.pig;
 or
 grunt> run myscript.pig;
-</source> 
+</source>
 
-<p>Command Line, Batch Mode</p>
+<p>Command Line - batch, mapreduce mode</p>
    <source>
 $ pig myscript.pig
 </source> 
 
+<p>Command Line - batch, local mode mode</p>
+   <source>
+$ pig -x local myscript.pig
+</source> 
+
 <p></p>
    <p><em>In general</em>, Pig processes Pig Latin statements as follows:</p>
    <ol>
@@ -105,7 +125,9 @@ DUMP B;
 </source>   
    
    <p> </p>
-   <p>Note: See Multi-Query Execution for more information on how Pig Latin 
statements are processed.</p> 
+   <p>See <a href="#Multi-Query+Execution">Multi-Query Execution</a> for more 
information on how Pig Latin statements are processed.</p> 
+   
+   
    </section>
    
    <section>

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
 Thu Apr 22 19:35:03 2010
@@ -5457,14 +5457,16 @@ readability, programmers usually use GRO
    <para>The GROUP operator groups together tuples that have the same group 
key (key field). The key field will be a tuple if the group key has more than 
one field, otherwise it will be the same type as that of the group key. The 
result of a GROUP operation is a relation that includes one tuple per group. 
This tuple contains two fields: </para>
    <itemizedlist>
       <listitem>
-         <para>The first field is named "group" (do not confuse this with the 
GROUP operator) and is the same type of the group key.</para>
+         <para>The first field is named "group" (do not confuse this with the 
GROUP operator) and is the same type as the group key.</para>
       </listitem>
       <listitem>
          <para>The second field takes the name of the original relation and is 
type bag.</para>
-         <para/>
+     </listitem>
+     <listitem>
          <para>The names of both fields are generated by the system as shown 
in the example below.</para>
       </listitem>
    </itemizedlist>
+   <para></para>
    <para>
    Note that the GROUP (and thus COGROUP) and JOIN operators perform similar 
functions. GROUP creates a nested set of output tuples while JOIN creates a 
flat set of output tuples.
    </para>
@@ -6642,7 +6644,7 @@ a:8,b:4,c:3
                <para>cmd_alias</para>
             </entry>
             <entry>
-               <para>The name of a command created using the <ulink 
url="#DEFINE">DEFINE</ulink> operator.</para>
+               <para>The name of a command created using the <ulink 
url="#DEFINE">DEFINE</ulink> operator (see the DEFINE operator for additional  
streaming examples).</para>
             </entry>
          </row>
          <row>
@@ -6672,13 +6674,13 @@ A = LOAD 'data';
 
 B = STREAM A THROUGH 'stream.pl -n 5';
 </programlisting>
-   <para>When used with a cmd_alias, a stream statement could look like this, 
where cmd is the defined alias.</para>
+   <para>When used with a cmd_alias, a stream statement could look like this, 
where mycmd is the defined alias.</para>
 <programlisting>
 A = LOAD 'data';
 
-DEFINE cmd 'stream.pl ân 5';
+DEFINE mycmd 'stream.pl ân 5';
 
-B = STREAM A THROUGH cmd;
+B = STREAM A THROUGH mycmd;
 </programlisting>
    </section>
    
@@ -6741,10 +6743,7 @@ E = STREAM C THROUGH 'stream.pl';
 X = STREAM A THROUGH 'stream.pl' as (f1:int, f2;int, f3:int);
 </programlisting>
    </section>
-   
-   <section>
-   <title>Additional Examples</title>
-   <para>See the UDF statement DEFINE for additional 
examples.</para></section></section>
+   </section>
    
    <section>
    <title>UNION</title>
@@ -7120,8 +7119,9 @@ Local Rearrange[tuple]{chararray}(false)
    
    <section>
    <title>ILLUSTRATE</title>
+   <para>(Note! This feature is NOT maintained at the moment. We are looking 
for someone to adopt it.)</para>
    <para>Displays a step-by-step execution of a sequence of statements.</para>
-   
+
    <section>
    <title>Syntax</title>
    <informaltable frame="all">
@@ -7248,14 +7248,14 @@ ILLUSTRATE num_user_visits;
    
    <section>
    <title>DEFINE</title>
-   <para>Assigns an alias to a function or command.</para>
+   <para>Assigns an alias to a UDF function or a streaming command.</para>
    
    <section>
    <title>Syntax</title>
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>DEFINE alias {function | ['command' [input] [output] 
[ship] [cache]] };</para>
+               <para>DEFINE alias {function | [`command` [input] [output] 
[ship] [cache]] };</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -7268,7 +7268,7 @@ ILLUSTRATE num_user_visits;
                <para>alias</para>
             </entry>
             <entry>
-               <para>The name for the function or command.</para>
+               <para>The name for a UDF function or the name for a streaming 
command (the cmd_alias for the <ulink url="#STREAM">STREAM</ulink> operator). 
</para>
             </entry>
          </row>
          <row>
@@ -7276,14 +7276,16 @@ ILLUSTRATE num_user_visits;
                <para>function</para>
             </entry>
             <entry>
-               <para>The name of a function.</para>
+            <para>For use with functions.</para>
+               <para>The name of a UDF function. </para>
             </entry>
          </row>
          <row>
             <entry>
-               <para>`command `</para>
+               <para>`command`</para>
             </entry>
             <entry>
+            <para>For use with streaming.</para>
                <para>A command, including the arguments, enclosed in back tics 
(where a command is anything that can be executed).</para>
             </entry>
          </row>
@@ -7292,6 +7294,7 @@ ILLUSTRATE num_user_visits;
                <para>input</para>
             </entry>
             <entry>
+                <para>For use with streaming.</para>
                <para>INPUT ( {stdin | 'path'} [USING serializer] [, {stdin | 
'path'} [USING serializer] â¦] )</para>
                <para>Where:</para>
                <itemizedlist>
@@ -7305,7 +7308,7 @@ ILLUSTRATE num_user_visits;
                      <para>USING â Keyword.</para>
                   </listitem>
                   <listitem>
-                     <para>serializer â PigStream is the default serializer. 
</para>
+                     <para>serializer â PigStreaming is the default 
serializer. </para>
                   </listitem>
                </itemizedlist>
             </entry>
@@ -7315,6 +7318,7 @@ ILLUSTRATE num_user_visits;
                <para>output</para>
             </entry>
             <entry>
+            <para>For use with streaming.</para>
                <para>OUTPUT ( {stdout | stderr | 'path'} [USING deserializer] 
[, {stdout | stderr | 'path'} [USING deserializer] â¦] )</para>
                <para>Where:</para>
                <itemizedlist>
@@ -7328,7 +7332,7 @@ ILLUSTRATE num_user_visits;
                      <para>USING â Keyword.</para>
                   </listitem>
                   <listitem>
-                     <para>deserializer â PigStream is the default 
deserializer. </para>
+                     <para>deserializer â PigStreaming is the default 
deserializer. </para>
                   </listitem>
                </itemizedlist>
             </entry>
@@ -7338,6 +7342,7 @@ ILLUSTRATE num_user_visits;
                <para>ship</para>
             </entry>
             <entry>
+            <para>For use with streaming.</para>
                <para>SHIP('path' [, 'path' â¦])</para>
                <para>Where:</para>
                <itemizedlist>
@@ -7355,6 +7360,7 @@ ILLUSTRATE num_user_visits;
                <para>cache</para>
             </entry>
             <entry>
+            <para>For use with streaming.</para>
                <para>CACHE('dfs_path#dfs_file' [, 'dfs_path#dfs_file' 
â¦])</para>
                <para>Where:</para>
                <itemizedlist>
@@ -7371,8 +7377,8 @@ ILLUSTRATE num_user_visits;
    
    <section>
    <title>Usage</title>
-   <para>Use the DEFINE statement to assign a name (alias) to a function or to 
a command.</para>
-   <para>Use DEFINE to specify a function when:</para>
+   <para>Use the DEFINE statement to assign a name (alias) to a UDF function 
or to a streaming command.</para>
+   <para>Use DEFINE to specify a UDF function when:</para>
    <itemizedlist>
       <listitem>
          <para>The function has a long package name that you don't want to 
include in a script, especially if you call the function several times in that 
script.</para>
@@ -7381,10 +7387,19 @@ ILLUSTRATE num_user_visits;
          <para>The constructor for the function takes string parameters. If 
you need to use different constructor parameters for different calls to the 
function you will need to create multiple defines â one for each parameter 
set.</para>
       </listitem>
    </itemizedlist>
-   <para>Use DEFINE to specify a command when the <ulink 
url="#STREAM">streaming</ulink> command specification is complex or requires 
additional parameters (input, output, and so on).</para>
+   <para>Use DEFINE to specify a streaming command when: </para>
+   <itemizedlist>
+   <listitem>
+   <para>The streaming command specification is complex.</para>
+   </listitem>
+      <listitem>
+   <para>The streaming command specification requires additional parameters 
(input, output, and so on).</para>
+   </listitem>
+   </itemizedlist>
    
-   <section
-   ><title>About Input and Output</title>
+   
+   <section>
+   <title>About Input and Output</title>
    <para>Serialization is needed to convert data from tuples to a format that 
can be processed by the streaming application. Deserialization is needed to 
convert the output from the streaming application back into tuples. 
PigStreaming is the default serialization/deserialization function.</para>
    
 <para>Streaming uses the same default format as PigStorage to 
serialize/deserialize the data. If you want to explicitly specify a format, you 
can do it as show below (see more examples in the Examples: Input/Output 
section).  </para> 
@@ -7534,6 +7549,19 @@ X = STREAM A THROUGH Y;
 </programlisting>
    </section>
    
+ </section>  
+     <section>
+   <title>Example: DEFINE with STREAM</title>
+<para>In this example a command is defined for use with the <ulink 
url="#STREAM">STREAM</ulink> operator.</para>
+<programlisting>
+A = LOAD 'data';
+
+DEFINE mycmd 'stream_cmd âinput file.dat';
+
+B = STREAM A through mycmd;
+</programlisting>
+</section>   
+   
    <section>
    <title>Examples: Logging</title>
    <para>In this example the streaming stderr is stored in the 
_logs/&lt;dir&gt; directory of the job's output directory. Because the job can 
have multiple streaming applications associated with it, you need to ensure 
that different directory names are used to avoid conflicts. Pig stores up to 
100 tasks per streaming job.</para>
@@ -7554,16 +7582,8 @@ A = LOAD 'students';
 
 B = FOREACH A GENERATE myFunc($0);
 </programlisting>
-   
-<para>In this example a command is defined for use with the STREAM 
operator.</para>
-<programlisting>
-A = LOAD 'data';
 
-DEFINE cmd 'stream_cmd âinput file.dat';
 
-B = STREAM A through cmd;
-</programlisting>
-</section>
 </section>
    
    <section>

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml
 Thu Apr 22 19:35:03 2010
@@ -28,22 +28,12 @@
       <title>Requirements</title>
       <p><strong>Unix</strong> and <strong>Windows</strong> users need the 
following:</p>
                <ol>
-                 <li> <strong>Hadoop 20</strong> - <a 
href="http://hadoop.apache.org/core/";>http://hadoop.apache.org/core/</a></li>
-                 <li> <strong>Java 1.6</strong> - <a 
href="http://java.sun.com/javase/downloads/index.jsp";>http://java.sun.com/javase/downloads/index.jsp</a>
 Set JAVA_HOME to the root of your Java installation.</li>
-                 <li> <strong>Ant 1.7</strong> - (optional, for builds) <a 
href="http://ant.apache.org/";>http://ant.apache.org/</a></li>
-                 <li> <strong>JUnit 4.5</strong> - (optional, for unit tests) 
<a href="http://junit.sourceforge.net/";>http://junit.sourceforge.net/</a></li>
+                 <li> <strong>Hadoop 0.20.2</strong> - <a 
href="http://hadoop.apache.org/common/releases.html";>http://hadoop.apache.org/common/releases.html</a></li>
+                 <li> <strong>Java 1.6</strong> - <a 
href="http://java.sun.com/javase/downloads/index.jsp";>http://java.sun.com/javase/downloads/index.jsp</a>
 (set JAVA_HOME to the root of your Java installation)</li>
+                 <li> <strong>Ant 1.7</strong> - <a 
href="http://ant.apache.org/";>http://ant.apache.org/</a> (optional, for builds) 
</li>
+                 <li> <strong>JUnit 4.5</strong> - <a 
href="http://junit.sourceforge.net/";>http://junit.sourceforge.net/</a> 
(optional, for unit tests) </li>
                </ol>
        <p><strong>Windows</strong> users need to install Cygwin and the Perl 
package: <a href="http://www.cygwin.com/";> http://www.cygwin.com/</a></p>
-    </section>
-       <section>
-               <title>Run Modes</title>
-               <p>Pig has two run modes or exectypes:  </p>
-    <ul>
-      <li><p> Local Mode - To run Pig in local mode, you need access to a 
single machine.  </p></li>
-      <li><p> Mapreduce Mode - To run Pig in mapreduce mode, you need access 
to a Hadoop cluster and HDFS installation. 
-      Pig will automatically allocate and deallocate a 15-node 
cluster.</p></li>
-    </ul>
-    <p>You can run the Grunt shell, Pig scripts, or embedded programs using 
either mode.</p>
     </section>         
 </section>      
         
@@ -68,6 +58,18 @@ $ pig 
 </source>
 </section>  
 
+       <section>
+               <title>Run Modes</title>
+               <p>Pig has two run modes or exectypes:  </p>
+    <ul>
+      <li><p> Local Mode - To run Pig in local mode, you need access to a 
single machine.  </p></li>
+      <li><p> Mapreduce Mode - To run Pig in mapreduce mode, you need access 
to a Hadoop cluster and HDFS installation. 
+      Pig will automatically allocate and deallocate a 15-node 
cluster.</p></li>
+    </ul>
+    <p>You can run the Grunt shell, Pig scripts, or embedded programs using 
either mode.</p>
+    </section> 
+
+
 <section>
 <title>Grunt Shell</title>
 <p>Use Pig's interactive shell, Grunt, to enter pig commands manually. See the 
<a href="setup.html#Sample+Code">Sample Code</a> for instructions about the 
passwd file used in the example.</p>
@@ -126,11 +128,16 @@ $ pig -x mapreduce id.pig
 
 <section>
        <title>Environment Variables and Properties</title>
-       <p>Refer to the <a href="setup.html#Download+Pig">Download Pig</a> 
section.</p>
+       <p>See <a href="setup.html#Download+Pig">Download Pig</a>.</p>
        <p>The Pig environment variables are described in the Pig script file, 
located in the  /pig-n.n.n/bin directory.</p>
        <p>The Pig properties file, pig.properties, is located in the 
/pig-n.n.n/conf directory. You can specify an alternate location using the 
PIG_CONF_DIR environment variable.</p>
 </section>
 
+       <section>
+               <title>Run Modes</title>
+               <p>See <a href="setup.html#Run+Modes">Run Modes</a>. </p>
+    </section>
+
 <section>
 <title>Embedded Programs</title>
 <p>Used the embedded option to embed Pig commands in a host language and run 
the program. 

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml
 Thu Apr 22 19:35:03 2010
@@ -39,17 +39,23 @@ See http://forrest.apache.org/docs/linki
 <site label="Pig" href="" xmlns="http://apache.org/forrest/linkmap/1.0";
   tab="">
 
-  <docs label="Getting Started"> 
+  <docs label="Pig"> 
     <index label="Overview"                            href="index.html" />
     <quickstart label="Setup"              href="setup.html" />
     <tutorial label="Tutorial"                                 
href="tutorial.html" />
-    </docs>  
-     <docs label="Guides"> 
     <plref1 label="Pig Latin 1"        href="piglatin_ref1.html" />
     <plref2 label="Pig Latin 2"        href="piglatin_ref2.html" />
     <cookbook label="Cookbook"                 href="cookbook.html" />
     <udf label="UDFs" href="udf.html" />
     </docs>  
+    
+     <docs label="Pig Miscellaneous"> 
+     <api      label="API Docs" href="ext:api"/>
+     <wiki  label="Wiki" href="ext:wiki" />
+     <faq  label="FAQ" href="ext:faq" />
+     <relnotes  label="Release Notes"  href="ext:relnotes" />
+    </docs>
+
     <docs label="Zebra"> 
      <zover label="Zebra Overview "    href="zebra_overview.html" />
      <zusers label="Zebra Users "      href="zebra_users.html" />

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml
 Thu Apr 22 19:35:03 2010
@@ -32,6 +32,6 @@
   -->
   <tab label="Project" href="http://hadoop.apache.org/pig/"; type="visible" /> 
   <tab label="Wiki" href="http://wiki.apache.org/pig/"; type="visible" /> 
-  <tab label="Pig 0.6.0 Documentation" dir="" type="visible" /> 
+  <tab label="Pig 0.7.0 Documentation" dir="" type="visible" /> 
 
 </tabs>

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml
 Thu Apr 22 19:35:03 2010
@@ -36,15 +36,15 @@
 </li>
 </ul>
 <p>The Pig tutorial file (tutorial/pigtutorial.tar.gz file in the pig 
distribution) includes the Pig JAR file (pig.jar) and the tutorial files 
(tutorial.jar, Pigs scripts, log files). 
-These files work with Hadoop 0.20 and provide everything you need to run the 
Pig scripts.</p>
+These files work with Hadoop 0.20.2 and include everything you need to run the 
Pig scripts.</p>
 
 <p>To get started, follow these basic steps:  </p>
 <ol>
-<li><p>Install Java. </p>
+<li><p>Install Java </p>
 </li>
-<li><p>Download the Pig tutorial file and install Pig. </p>
+<li><p>Install Pig </p>
 </li>
-<li><p>Run the Pig scripts - locally or on a Hadoop cluster.  </p>
+<li><p>Run the Pig scripts - in Local or Hadoop mode </p>
 </li>
 </ol>
 </section>
@@ -53,12 +53,12 @@ These files work with Hadoop 0.20 and pr
 <title> Java Installation</title>
 
 <p>Make sure your run-time environment includes the following: </p>
-<ol >
+<ul >
 <li><p>Java 1.6 or higher (preferably from Sun) </p>
 </li>
 <li><p>The JAVA_HOME environment variable is set the root of your Java 
installation.  </p>
 </li>
-</ol>
+</ul>
 
 </section>
 
@@ -70,21 +70,17 @@ These files work with Hadoop 0.20 and pr
 <li><p>Download the Pig tutorial file to your local directory. </p>
 </li>
 <li><p>Unzip the Pig tutorial file (the files are stored in a newly created 
directory, pigtmp). </p>
-</li>
-</ol>
-
 <source>
 $ tar -xzf pigtutorial.tar.gz
 </source>
-<p> </p>
-<ol>
+</li>
 <li><p>Move to the pigtmp directory.  </p>
 </li>
 <li><p>Review the contents of the Pig tutorial file. </p>
 </li>
 <li><p>Copy the <strong>pig.jar</strong> file to the appropriate directory on 
your system. For example: /home/me/pig.  </p>
 </li>
-<li><p>Create an environment variable, <strong>PIGDIR</strong>, and point it 
to your directory. For example: export PIGDIR=/home/me/pig (bash, sh) or setenv 
PIGDIR /home/me/pig (tcsh, csh).  </p>
+<li><p>Create an environment variable, <strong>PIGDIR</strong>, and point it 
to your directory; for example, export PIGDIR=/home/me/pig (bash, sh) or setenv 
PIGDIR /home/me/pig (tcsh, csh).  </p>
 </li>
 </ol>
 
@@ -95,26 +91,30 @@ $ tar -xzf pigtutorial.tar.gz
 
 <p>To run the Pig scripts in local mode, do the following: </p>
 <ol>
-<li><p>Move to the pigtmp directory. </p>
-</li>
-<li><p>Review Pig Script 1 and Pig Script 2. </p>
-</li>
-<li><p>Execute the following command (using either script1-local.pig or 
script2-local.pig). </p>
+<li>
+<p>Set the maximum memory for Java.</p>
+<source>
+java -Xmx256m -cp pig.jar org.apache.pig.Main -x local script1-local.pig
+java -Xmx256m -cp pig.jar org.apache.pig.Main -x local script2-local.pig
+</source>
 </li>
-</ol>
-
+<li><p>Move to the pigtmp directory. </p></li>
+<li><p>Review Pig Script 1 and Pig Script 2. </p></li>
+<li>
+<p>Execute the following command (using either script1-local.pig or 
script2-local.pig). </p>
 <source>
 $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local script1-local.pig
 </source>
-<ol>
-<li><p>Review the result file (either script1-local-results.txt or 
script2-local-results.txt): </p>
 </li>
-</ol>
-
+<li><p>Review the result files, located in the part-r-00000 directory.</p>
+<p>The output may contain a few Hadoop warnings which can be ignored:</p>
 <source>
-$ ls -l script1-local-results.txt
-$ cat script1-local-results.txt
+2010-04-08 12:55:33,642 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics 
+- Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - 
already initialized
 </source>
+</li>
+</ol>
+
 
 </section>
 
@@ -128,32 +128,26 @@ $ cat script1-local-results.txt
 <li><p>Review Pig Script 1 and Pig Script 2. </p>
 </li>
 <li><p>Copy the excite.log.bz2 file from the pigtmp directory to the HDFS 
directory. </p>
-</li>
-</ol>
-
 <source>
 $ hadoop fs âcopyFromLocal excite.log.bz2 .
 </source>
-<ol>
+</li>
+
 <li><p>Set the HADOOP_CONF_DIR environment variable to the location of your 
core-site.xml, hdfs-site.xml and mapred-site.xml files. </p>
 </li>
 <li><p>Execute the following command (using either script1-hadoop.pig or 
script2-hadoop.pig): </p>
-</li>
-</ol>
-
 <source>
 $ java -cp $PIGDIR/pig.jar:$HADOOP_CONF_DIR  org.apache.pig.Main 
script1-hadoop.pig
 </source>
-<ol>
-<li><p>Review the result files (located in either the script1-hadoop-results 
or script2-hadoop-results HDFS directory): </p>
 </li>
-</ol>
 
+<li><p>Review the result files, located in the script1-hadoop-results or 
script2-hadoop-results HDFS directory: </p>
 <source>
 $ hadoop fs -ls script1-hadoop-results
 $ hadoop fs -cat 'script1-hadoop-results/*' | less
 </source>
-
+</li>
+</ol>
 
 </section>
 

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml 
(original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml 
Thu Apr 22 19:35:03 2010
@@ -749,7 +749,7 @@ This enables Pig users/developers to cre
 <section>
 <title> Load Functions</title>
 <p><a 
href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup";>LoadFunc</a>
 
-abstract class has the main methods for loading data and for most use cases it 
would suffice to extend it. There are 3 other optional interfaces which can be 
implemented to achieve extended functionality: </p>
+abstract class has the main methods for loading data and for most use cases it 
would suffice to extend it. There are three other optional interfaces which can 
be implemented to achieve extended functionality: </p>
 
 <ul>
 <li><a 
href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup";>LoadMetadata</a>
 
@@ -780,7 +780,7 @@ has methods to convert byte arrays to sp
 
 <p><strong>Example Implementation</strong></p>
 <p>
-The loader implementation in the example is a loader for text data with line 
delimiter as '\n' and '\t' as default field delimiter (which can be overridden 
by passing a different field delimiter in the constructor) - this is similar to 
current PigStorage loader in Pig. The implementation uses an existing Hadoop 
supported !Inputformat - TextInputFormat as the underlying InputFormat.
+The loader implementation in the example is a loader for text data with line 
delimiter as '\n' and '\t' as default field delimiter (which can be overridden 
by passing a different field delimiter in the constructor) - this is similar to 
current PigStorage loader in Pig. The implementation uses an existing Hadoop 
supported Inputformat - TextInputFormat - as the underlying InputFormat.
 </p>
 <source>
 public class SimpleTextLoader extends LoadFunc {
@@ -1239,132 +1239,19 @@ public class IntMax extends EvalFunc&lt;
 
  </section>
  
- <section>
-<title>Custom Slicer</title>
 
-<p>Sometimes a <code>LoadFunc</code> needs more control over how input is 
chopped up or even found. </p>
-<p>Here are some scenarios that call for a custom slicer: </p>
-<ul>
-<li><p> Input needs to be chopped up differently than on block boundaries. 
(Perhaps you want every 1M instead of every 128M. Or, you may want to process 
in big 1G chunks.) </p>
-</li>
-<li><p> Input comes from a source outside of HDFS. (Perhaps you are reading 
from a database.) </p>
-</li>
-<li><p> There are locality preferences for processing the data that is more 
than simple HDFS locality. </p>
-</li>
-<li><p> Extra information needs to be passed from the client machine to the 
<code>LoadFunc</code> instances running remotely. </p>
-</li>
-</ul>
-<p>All of these scenarios are addressed by slicers. There are two parts to the 
slicing framework: <code>Slicer</code>, the class that creates slices, and 
<code>Slice</code>, the class that represents a particular piece of the input. 
Slicing kicks in when Pig sees that the <code>LoadFunc</code> implements the 
<code>Slicer</code> interface. </p>
-
-
-<section>
-<title>Slicer</title>
-
-<p>The slicer has two basic functions: validate input and slice up the input. 
Both of these methods will be called on the client machine.  </p>
-
-<source>
-public interface Slicer {
-    void validate(DataStorage store, String location) throws IOException;
-   Slice[] slice(DataStorage store, String location) throws IOException;
-}
-</source>
-</section>
-
-<section>
-<title>Slice</title>
-
-<p>Each slice describes a unit of work and will correspond to a map task in 
Hadoop. </p>
-
-<source>
-public interface Slice extends Serializable {
-    String[] getLocations();
-    void init(DataStorage store) throws IOException;
-    long getStart();
-    long getLength();
-    void close() throws IOException;
-    long getPos() throws IOException;
-    float getProgress() throws IOException;
-    boolean next(Tuple value) throws IOException;
-}
-</source>
-
-<p>Only one of the methods is used for scheduling: 
<code>getLocations()</code>. This method allows the implementor to give hints 
to Pig about where the task should be run. It is only a hint. If things are 
busy, the task may get scheduled elsewhere. </p>
-<p>The rest of the <code>Slice</code> methods are used to read records on the 
processing nodes. <code>init</code> is called right after the 
<code>Slice</code> object is deserialized and <code>close</code> is called 
after the last record has been read. The Pig runtime will read records from the 
<code>Slice</code> until <code>getPos()</code> exceeds 
<code>getLength()</code>. Because <code>Slice</code> implements serializable, 
<code>Slicer</code> can encode information in the <code>Slice</code> that will 
later be available when the task is run. </p>
-
-</section>
 
 <section>
-<title> Example</title>
-
-<p>This example shows a simple <code>Slicer</code> that gets a count from the 
input stream and generates that number of <code>Slice</code> s.  </p>
-
-<source>
-public class RangeSlicer implements Slicer, LoadFunc {
-    /**
-     * Expects location to be a Stringified integer, and makes
-     * Integer.parseInt(location) slices. Each slice generates a single value,
-     * its index in the sequence of slices.
-     */
-    public Slice[] slice (DataStorage store, String location) throws 
IOException {
-        // Note: validate has already made sure that location is an integer
-        int numslices = Integer.parseInt(location);
-        Slice[] slices = new Slice[numslices];
-        for (int i = 0; i  slices.length; i++) {
-            slices[i] = new SingleValueSlice(i);
-        }
-        return slices;
-    }
-    public void validate(DataStorage store, String location) throws 
IOException {
-        try {
-            Integer.parseInt(location);
-        } catch (NumberFormatException nfe) {
-            throw new IOException(nfe.getMessage());
-        }
-    }
-    /**
-     * A Slice that returns a single value from next.
-     */
-    public static class SingleValueSlice implements Slice {
-        // note this value is set by the Slicer and will get serialized and 
deserialized at the remote processing node
-        public int val;
-        // since we just have a single value, we can use a boolean rather than 
a counter
-        private transient boolean read;
-        public SingleValueSlice (int value) {
-            this.val = value;
-        }
-        public void close () throws IOException {}
-        public long getLength () { return 1; }
-        public String[] getLocations () { return new String[0]; }
-        public long getStart() { return 0; }
-        public long getPos () throws IOException { return read ? 1 : 0; }
-        public float getProgress () throws IOException { return read ? 1 : 0; }
-        public void init (DataStorage store) throws IOException {}
-        public boolean next (Tuple value) throws IOException {
-            if (!read) {
-                value.appendField(new DataAtom(val));
-                read = true;
-                return true;
-            }
-            return false;
-        }
-        private static final long serialVersionUID = 1L;
-    }
-}
-</source>
+<title>Passing Configurations to UDFs</title>
+<p>The singleton UDFContext class provides two features to UDF writers. First, 
on the backend, it allows UDFs to get access to the JobConf object, by calling 
getJobConf. This is only available on the backend (at run time) as the JobConf 
has not yet been constructed on the front end (during planning time).</p>
 
-<p>You can invoke the <code>RangeSlicer</code> class with the following Pig 
Latin statement: </p>
-
-<source>
-LOAD '27' USING RangeSlicer();
-</source>
+<p>Second, it allows UDFs to pass configuration information between 
instantiations of the UDF on the front and backends. UDFs can store information 
in a configuration object when they are constructed on the front end, or during 
other front end calls such as describeSchema. They can then read that 
information on the backend when exec (for EvalFunc) or getNext (for LoadFunc) 
is called. Note that information will not be passed between instantiations of 
the function on the backend. The communication channel only works from front 
end to back end.</p>
 
+<p>To store information, the UDF calls getUDFProperties. This returns a 
Properties object which the UDF can record the information in or read the 
information from. To avoid name space conflicts UDFs are required to provide a 
signature when obtaining a Properties object. This can be done in two ways. The 
UDF can provide its Class object (via this.getClass()). In this case, every 
instantiation of the UDF will be given the same Properties object. The UDF can 
also provide its Class plus an array of Strings. The UDF can pass its 
constructor arguments, or some other identifying strings. This allows each 
instantiation of the UDF to have a different properties object thus avoiding 
name space collisions between instantiations of the UDF.</p>
 </section>
 
 </section>
 
-
-</section>
-
 </body>
 </document>
 

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml
 Thu Apr 22 19:35:03 2010
@@ -36,7 +36,7 @@
 <!-- HADOOP M/R API-->
     <section>
    <title>Hadoop MapReduce APIs</title> 
-  <p>Zebra requires Hadoop 20. This release of Zebra supports the "new" 
jobContext-style MapReduce APIs. </p>
+  <p>This release of Zebra supports the "new" jobContext-style MapReduce APIs. 
</p>
    <ul>
   <li>org.apache.hadoop.mapreduce.* - supported ("new" jobContext-style 
mapreduce API)</li>
   <li>org.apache.hadoop.mapred.* - supported, but deprecated ("old" 
jobConf-style mapreduce API)</li>
@@ -49,18 +49,7 @@
  <!-- ZEBRA API-->
    <section>
    <title>Zebra MapReduce APIs</title>
-    <p>Zebra includes several classes for use in MapReduce programs, located 
here (.....).</p>
-    <p>Please note these APIs. The main entry point into Zebra are the two 
classes for reading and writing tables, namely TableInputFormat and 
BasicTableOutputFormat.   </p> 
-    <ul>
-               <li>BasicTableOutputFormat</li>
-               <li>TableInputformat</li>
-               <li>TableRecordReader</li>
-               <li>ZebraOutputPartition</li>
-               <li>ZebraProjection</li>
-               <li>ZebraSchema</li>
-               <li>ZebraStorageHint</li>
-               <li>ZebraSortInfo</li>
-       </ul>
+    <p>Zebra includes several classes for use in MapReduce programs. The main 
entry point into Zebra are the two classes for reading and writing tables, 
namely TableInputFormat and BasicTableOutputFormat.   </p> 
          </section>
  <!-- END ZEBRA API--> 
 

Modified: 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml
URL: 
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml?rev=937032&r1=937031&r2=937032&view=diff
==============================================================================
--- 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml
 (original)
+++ 
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml
 Thu Apr 22 19:35:03 2010
@@ -42,16 +42,17 @@
    <title>Prerequisites</title> 
    <p>Zebra requires:</p>
    <ul>
-   <li>Pig 0.7.0 or later</li>
-   <li>Hadoop 0.20.1 or later</li>
+   <li>Pig 0.7.0 or later </li>
+   <li>Hadoop 0.20.2 or later</li>
    </ul>
    <p></p>
    <p>Also, make sure the following software is installed on your system:</p>
    <ul>
    <li>JDK 1.6</li>
    <li>Ant 1.7.1</li>
-   <li>Javacc 4.2</li>
    </ul>
+   <p></p>
+   <p><strong>Note:</strong> Zebra requires Pig.jar in its classpath to 
compile and run.</p>
    </section>
   
    <section>

svn commit: r937032 - in /hadoop/pig/branches/branch-0.7: ./ src/docs/src/documentation/content/xdocs/

Reply via email to