Author: olga Date: Thu Apr 22 19:35:03 2010 New Revision: 937032 URL: http://svn.apache.org/viewvc?rev=937032&view=rev Log: PIG-1320: final documentation updates for Pig 0.7.0 (chandec via olgan)
Modified: hadoop/pig/branches/branch-0.7/CHANGES.txt hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml Modified: hadoop/pig/branches/branch-0.7/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/CHANGES.txt?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/CHANGES.txt (original) +++ hadoop/pig/branches/branch-0.7/CHANGES.txt Thu Apr 22 19:35:03 2010 @@ -68,6 +68,8 @@ manner (rding via pradeepkth) IMPROVEMENTS +PIG-1320: final documentation updates for Pig 0.7.0 (chandec via olgan) + PIG-1330: Move pruned schema tracking logic from LoadFunc to core code (daijy) PIG-1320: more documentation updates for Pig 0.7.0 (chandec via olgan) Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml Thu Apr 22 19:35:03 2010 @@ -25,7 +25,6 @@ </header> <body> - <!-- ABOUT PIG LATIN --> <section> <title>Overview</title> @@ -55,25 +54,46 @@ <section> <title>Running Pig Latin </title> - <p>You can execute Pig Latin statements interactively or in batch mode using Pig scripts (see the <a href="piglatin_ref2.html#exec">exec</a> and <a href="piglatin_ref2.html#run">run</a> commands).</p> - <p>Grunt Shell, Interactive or Batch Mode</p> + <p>You can execute Pig Latin statements: </p> + <ul> + <li>Using grunt shell or command line</li> + <li>In mapreduce mode or local mode</li> + <li>Either interactively or in batch </li> + </ul> + + + <p></p> +<p>Note that Pig now uses Hadoop's local mode (rather than Pig's native local mode).</p> +<p>A few run examples are shown here; see <a href="setup.html">Pig Setup</a> for more examples.</p> + + + <p>Grunt Shell - interactive, mapreduce mode (because mapreduce mode is the default you do not need to specify)</p> <source> $ pig ... - Connecting to ... grunt> A = load 'data'; grunt> B = ... ; -or +</source> + + <p>Grunt Shell - batch, local mode (see the <a href="piglatin_ref2.html#exec">exec</a> and <a href="piglatin_ref2.html#run">run</a> commands)</p> + <source> +$ pig -x local grunt> exec myscript.pig; or grunt> run myscript.pig; -</source> +</source> -<p>Command Line, Batch Mode</p> +<p>Command Line - batch, mapreduce mode</p> <source> $ pig myscript.pig </source> +<p>Command Line - batch, local mode mode</p> + <source> +$ pig -x local myscript.pig +</source> + <p></p> <p><em>In general</em>, Pig processes Pig Latin statements as follows:</p> <ol> @@ -105,7 +125,9 @@ DUMP B; </source> <p> </p> - <p>Note: See Multi-Query Execution for more information on how Pig Latin statements are processed.</p> + <p>See <a href="#Multi-Query+Execution">Multi-Query Execution</a> for more information on how Pig Latin statements are processed.</p> + + </section> <section> Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml Thu Apr 22 19:35:03 2010 @@ -5457,14 +5457,16 @@ readability, programmers usually use GRO <para>The GROUP operator groups together tuples that have the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP operation is a relation that includes one tuple per group. This tuple contains two fields: </para> <itemizedlist> <listitem> - <para>The first field is named "group" (do not confuse this with the GROUP operator) and is the same type of the group key.</para> + <para>The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key.</para> </listitem> <listitem> <para>The second field takes the name of the original relation and is type bag.</para> - <para/> + </listitem> + <listitem> <para>The names of both fields are generated by the system as shown in the example below.</para> </listitem> </itemizedlist> + <para></para> <para> Note that the GROUP (and thus COGROUP) and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples. </para> @@ -6642,7 +6644,7 @@ a:8,b:4,c:3 <para>cmd_alias</para> </entry> <entry> - <para>The name of a command created using the <ulink url="#DEFINE">DEFINE</ulink> operator.</para> + <para>The name of a command created using the <ulink url="#DEFINE">DEFINE</ulink> operator (see the DEFINE operator for additional streaming examples).</para> </entry> </row> <row> @@ -6672,13 +6674,13 @@ A = LOAD 'data'; B = STREAM A THROUGH 'stream.pl -n 5'; </programlisting> - <para>When used with a cmd_alias, a stream statement could look like this, where cmd is the defined alias.</para> + <para>When used with a cmd_alias, a stream statement could look like this, where mycmd is the defined alias.</para> <programlisting> A = LOAD 'data'; -DEFINE cmd 'stream.pl ân 5'; +DEFINE mycmd 'stream.pl ân 5'; -B = STREAM A THROUGH cmd; +B = STREAM A THROUGH mycmd; </programlisting> </section> @@ -6741,10 +6743,7 @@ E = STREAM C THROUGH 'stream.pl'; X = STREAM A THROUGH 'stream.pl' as (f1:int, f2;int, f3:int); </programlisting> </section> - - <section> - <title>Additional Examples</title> - <para>See the UDF statement DEFINE for additional examples.</para></section></section> + </section> <section> <title>UNION</title> @@ -7120,8 +7119,9 @@ Local Rearrange[tuple]{chararray}(false) <section> <title>ILLUSTRATE</title> + <para>(Note! This feature is NOT maintained at the moment. We are looking for someone to adopt it.)</para> <para>Displays a step-by-step execution of a sequence of statements.</para> - + <section> <title>Syntax</title> <informaltable frame="all"> @@ -7248,14 +7248,14 @@ ILLUSTRATE num_user_visits; <section> <title>DEFINE</title> - <para>Assigns an alias to a function or command.</para> + <para>Assigns an alias to a UDF function or a streaming command.</para> <section> <title>Syntax</title> <informaltable frame="all"> <tgroup cols="1"><tbody><row> <entry> - <para>DEFINE alias {function | ['command' [input] [output] [ship] [cache]] };</para> + <para>DEFINE alias {function | [`command` [input] [output] [ship] [cache]] };</para> </entry> </row></tbody></tgroup> </informaltable></section> @@ -7268,7 +7268,7 @@ ILLUSTRATE num_user_visits; <para>alias</para> </entry> <entry> - <para>The name for the function or command.</para> + <para>The name for a UDF function or the name for a streaming command (the cmd_alias for the <ulink url="#STREAM">STREAM</ulink> operator). </para> </entry> </row> <row> @@ -7276,14 +7276,16 @@ ILLUSTRATE num_user_visits; <para>function</para> </entry> <entry> - <para>The name of a function.</para> + <para>For use with functions.</para> + <para>The name of a UDF function. </para> </entry> </row> <row> <entry> - <para>`command `</para> + <para>`command`</para> </entry> <entry> + <para>For use with streaming.</para> <para>A command, including the arguments, enclosed in back tics (where a command is anything that can be executed).</para> </entry> </row> @@ -7292,6 +7294,7 @@ ILLUSTRATE num_user_visits; <para>input</para> </entry> <entry> + <para>For use with streaming.</para> <para>INPUT ( {stdin | 'path'} [USING serializer] [, {stdin | 'path'} [USING serializer] â¦] )</para> <para>Where:</para> <itemizedlist> @@ -7305,7 +7308,7 @@ ILLUSTRATE num_user_visits; <para>USING â Keyword.</para> </listitem> <listitem> - <para>serializer â PigStream is the default serializer. </para> + <para>serializer â PigStreaming is the default serializer. </para> </listitem> </itemizedlist> </entry> @@ -7315,6 +7318,7 @@ ILLUSTRATE num_user_visits; <para>output</para> </entry> <entry> + <para>For use with streaming.</para> <para>OUTPUT ( {stdout | stderr | 'path'} [USING deserializer] [, {stdout | stderr | 'path'} [USING deserializer] â¦] )</para> <para>Where:</para> <itemizedlist> @@ -7328,7 +7332,7 @@ ILLUSTRATE num_user_visits; <para>USING â Keyword.</para> </listitem> <listitem> - <para>deserializer â PigStream is the default deserializer. </para> + <para>deserializer â PigStreaming is the default deserializer. </para> </listitem> </itemizedlist> </entry> @@ -7338,6 +7342,7 @@ ILLUSTRATE num_user_visits; <para>ship</para> </entry> <entry> + <para>For use with streaming.</para> <para>SHIP('path' [, 'path' â¦])</para> <para>Where:</para> <itemizedlist> @@ -7355,6 +7360,7 @@ ILLUSTRATE num_user_visits; <para>cache</para> </entry> <entry> + <para>For use with streaming.</para> <para>CACHE('dfs_path#dfs_file' [, 'dfs_path#dfs_file' â¦])</para> <para>Where:</para> <itemizedlist> @@ -7371,8 +7377,8 @@ ILLUSTRATE num_user_visits; <section> <title>Usage</title> - <para>Use the DEFINE statement to assign a name (alias) to a function or to a command.</para> - <para>Use DEFINE to specify a function when:</para> + <para>Use the DEFINE statement to assign a name (alias) to a UDF function or to a streaming command.</para> + <para>Use DEFINE to specify a UDF function when:</para> <itemizedlist> <listitem> <para>The function has a long package name that you don't want to include in a script, especially if you call the function several times in that script.</para> @@ -7381,10 +7387,19 @@ ILLUSTRATE num_user_visits; <para>The constructor for the function takes string parameters. If you need to use different constructor parameters for different calls to the function you will need to create multiple defines â one for each parameter set.</para> </listitem> </itemizedlist> - <para>Use DEFINE to specify a command when the <ulink url="#STREAM">streaming</ulink> command specification is complex or requires additional parameters (input, output, and so on).</para> + <para>Use DEFINE to specify a streaming command when: </para> + <itemizedlist> + <listitem> + <para>The streaming command specification is complex.</para> + </listitem> + <listitem> + <para>The streaming command specification requires additional parameters (input, output, and so on).</para> + </listitem> + </itemizedlist> - <section - ><title>About Input and Output</title> + + <section> + <title>About Input and Output</title> <para>Serialization is needed to convert data from tuples to a format that can be processed by the streaming application. Deserialization is needed to convert the output from the streaming application back into tuples. PigStreaming is the default serialization/deserialization function.</para> <para>Streaming uses the same default format as PigStorage to serialize/deserialize the data. If you want to explicitly specify a format, you can do it as show below (see more examples in the Examples: Input/Output section). </para> @@ -7534,6 +7549,19 @@ X = STREAM A THROUGH Y; </programlisting> </section> + </section> + <section> + <title>Example: DEFINE with STREAM</title> +<para>In this example a command is defined for use with the <ulink url="#STREAM">STREAM</ulink> operator.</para> +<programlisting> +A = LOAD 'data'; + +DEFINE mycmd 'stream_cmd âinput file.dat'; + +B = STREAM A through mycmd; +</programlisting> +</section> + <section> <title>Examples: Logging</title> <para>In this example the streaming stderr is stored in the _logs/<dir> directory of the job's output directory. Because the job can have multiple streaming applications associated with it, you need to ensure that different directory names are used to avoid conflicts. Pig stores up to 100 tasks per streaming job.</para> @@ -7554,16 +7582,8 @@ A = LOAD 'students'; B = FOREACH A GENERATE myFunc($0); </programlisting> - -<para>In this example a command is defined for use with the STREAM operator.</para> -<programlisting> -A = LOAD 'data'; -DEFINE cmd 'stream_cmd âinput file.dat'; -B = STREAM A through cmd; -</programlisting> -</section> </section> <section> Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/setup.xml Thu Apr 22 19:35:03 2010 @@ -28,22 +28,12 @@ <title>Requirements</title> <p><strong>Unix</strong> and <strong>Windows</strong> users need the following:</p> <ol> - <li> <strong>Hadoop 20</strong> - <a href="http://hadoop.apache.org/core/">http://hadoop.apache.org/core/</a></li> - <li> <strong>Java 1.6</strong> - <a href="http://java.sun.com/javase/downloads/index.jsp">http://java.sun.com/javase/downloads/index.jsp</a> Set JAVA_HOME to the root of your Java installation.</li> - <li> <strong>Ant 1.7</strong> - (optional, for builds) <a href="http://ant.apache.org/">http://ant.apache.org/</a></li> - <li> <strong>JUnit 4.5</strong> - (optional, for unit tests) <a href="http://junit.sourceforge.net/">http://junit.sourceforge.net/</a></li> + <li> <strong>Hadoop 0.20.2</strong> - <a href="http://hadoop.apache.org/common/releases.html">http://hadoop.apache.org/common/releases.html</a></li> + <li> <strong>Java 1.6</strong> - <a href="http://java.sun.com/javase/downloads/index.jsp">http://java.sun.com/javase/downloads/index.jsp</a> (set JAVA_HOME to the root of your Java installation)</li> + <li> <strong>Ant 1.7</strong> - <a href="http://ant.apache.org/">http://ant.apache.org/</a> (optional, for builds) </li> + <li> <strong>JUnit 4.5</strong> - <a href="http://junit.sourceforge.net/">http://junit.sourceforge.net/</a> (optional, for unit tests) </li> </ol> <p><strong>Windows</strong> users need to install Cygwin and the Perl package: <a href="http://www.cygwin.com/"> http://www.cygwin.com/</a></p> - </section> - <section> - <title>Run Modes</title> - <p>Pig has two run modes or exectypes: </p> - <ul> - <li><p> Local Mode - To run Pig in local mode, you need access to a single machine. </p></li> - <li><p> Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. - Pig will automatically allocate and deallocate a 15-node cluster.</p></li> - </ul> - <p>You can run the Grunt shell, Pig scripts, or embedded programs using either mode.</p> </section> </section> @@ -68,6 +58,18 @@ $ pig </source> </section> + <section> + <title>Run Modes</title> + <p>Pig has two run modes or exectypes: </p> + <ul> + <li><p> Local Mode - To run Pig in local mode, you need access to a single machine. </p></li> + <li><p> Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. + Pig will automatically allocate and deallocate a 15-node cluster.</p></li> + </ul> + <p>You can run the Grunt shell, Pig scripts, or embedded programs using either mode.</p> + </section> + + <section> <title>Grunt Shell</title> <p>Use Pig's interactive shell, Grunt, to enter pig commands manually. See the <a href="setup.html#Sample+Code">Sample Code</a> for instructions about the passwd file used in the example.</p> @@ -126,11 +128,16 @@ $ pig -x mapreduce id.pig <section> <title>Environment Variables and Properties</title> - <p>Refer to the <a href="setup.html#Download+Pig">Download Pig</a> section.</p> + <p>See <a href="setup.html#Download+Pig">Download Pig</a>.</p> <p>The Pig environment variables are described in the Pig script file, located in the /pig-n.n.n/bin directory.</p> <p>The Pig properties file, pig.properties, is located in the /pig-n.n.n/conf directory. You can specify an alternate location using the PIG_CONF_DIR environment variable.</p> </section> + <section> + <title>Run Modes</title> + <p>See <a href="setup.html#Run+Modes">Run Modes</a>. </p> + </section> + <section> <title>Embedded Programs</title> <p>Used the embedded option to embed Pig commands in a host language and run the program. Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/site.xml Thu Apr 22 19:35:03 2010 @@ -39,17 +39,23 @@ See http://forrest.apache.org/docs/linki <site label="Pig" href="" xmlns="http://apache.org/forrest/linkmap/1.0" tab=""> - <docs label="Getting Started"> + <docs label="Pig"> <index label="Overview" href="index.html" /> <quickstart label="Setup" href="setup.html" /> <tutorial label="Tutorial" href="tutorial.html" /> - </docs> - <docs label="Guides"> <plref1 label="Pig Latin 1" href="piglatin_ref1.html" /> <plref2 label="Pig Latin 2" href="piglatin_ref2.html" /> <cookbook label="Cookbook" href="cookbook.html" /> <udf label="UDFs" href="udf.html" /> </docs> + + <docs label="Pig Miscellaneous"> + <api label="API Docs" href="ext:api"/> + <wiki label="Wiki" href="ext:wiki" /> + <faq label="FAQ" href="ext:faq" /> + <relnotes label="Release Notes" href="ext:relnotes" /> + </docs> + <docs label="Zebra"> <zover label="Zebra Overview " href="zebra_overview.html" /> <zusers label="Zebra Users " href="zebra_users.html" /> Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tabs.xml Thu Apr 22 19:35:03 2010 @@ -32,6 +32,6 @@ --> <tab label="Project" href="http://hadoop.apache.org/pig/" type="visible" /> <tab label="Wiki" href="http://wiki.apache.org/pig/" type="visible" /> - <tab label="Pig 0.6.0 Documentation" dir="" type="visible" /> + <tab label="Pig 0.7.0 Documentation" dir="" type="visible" /> </tabs> Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/tutorial.xml Thu Apr 22 19:35:03 2010 @@ -36,15 +36,15 @@ </li> </ul> <p>The Pig tutorial file (tutorial/pigtutorial.tar.gz file in the pig distribution) includes the Pig JAR file (pig.jar) and the tutorial files (tutorial.jar, Pigs scripts, log files). -These files work with Hadoop 0.20 and provide everything you need to run the Pig scripts.</p> +These files work with Hadoop 0.20.2 and include everything you need to run the Pig scripts.</p> <p>To get started, follow these basic steps: </p> <ol> -<li><p>Install Java. </p> +<li><p>Install Java </p> </li> -<li><p>Download the Pig tutorial file and install Pig. </p> +<li><p>Install Pig </p> </li> -<li><p>Run the Pig scripts - locally or on a Hadoop cluster. </p> +<li><p>Run the Pig scripts - in Local or Hadoop mode </p> </li> </ol> </section> @@ -53,12 +53,12 @@ These files work with Hadoop 0.20 and pr <title> Java Installation</title> <p>Make sure your run-time environment includes the following: </p> -<ol > +<ul > <li><p>Java 1.6 or higher (preferably from Sun) </p> </li> <li><p>The JAVA_HOME environment variable is set the root of your Java installation. </p> </li> -</ol> +</ul> </section> @@ -70,21 +70,17 @@ These files work with Hadoop 0.20 and pr <li><p>Download the Pig tutorial file to your local directory. </p> </li> <li><p>Unzip the Pig tutorial file (the files are stored in a newly created directory, pigtmp). </p> -</li> -</ol> - <source> $ tar -xzf pigtutorial.tar.gz </source> -<p> </p> -<ol> +</li> <li><p>Move to the pigtmp directory. </p> </li> <li><p>Review the contents of the Pig tutorial file. </p> </li> <li><p>Copy the <strong>pig.jar</strong> file to the appropriate directory on your system. For example: /home/me/pig. </p> </li> -<li><p>Create an environment variable, <strong>PIGDIR</strong>, and point it to your directory. For example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh). </p> +<li><p>Create an environment variable, <strong>PIGDIR</strong>, and point it to your directory; for example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh). </p> </li> </ol> @@ -95,26 +91,30 @@ $ tar -xzf pigtutorial.tar.gz <p>To run the Pig scripts in local mode, do the following: </p> <ol> -<li><p>Move to the pigtmp directory. </p> -</li> -<li><p>Review Pig Script 1 and Pig Script 2. </p> -</li> -<li><p>Execute the following command (using either script1-local.pig or script2-local.pig). </p> +<li> +<p>Set the maximum memory for Java.</p> +<source> +java -Xmx256m -cp pig.jar org.apache.pig.Main -x local script1-local.pig +java -Xmx256m -cp pig.jar org.apache.pig.Main -x local script2-local.pig +</source> </li> -</ol> - +<li><p>Move to the pigtmp directory. </p></li> +<li><p>Review Pig Script 1 and Pig Script 2. </p></li> +<li> +<p>Execute the following command (using either script1-local.pig or script2-local.pig). </p> <source> $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local script1-local.pig </source> -<ol> -<li><p>Review the result file (either script1-local-results.txt or script2-local-results.txt): </p> </li> -</ol> - +<li><p>Review the result files, located in the part-r-00000 directory.</p> +<p>The output may contain a few Hadoop warnings which can be ignored:</p> <source> -$ ls -l script1-local-results.txt -$ cat script1-local-results.txt +2010-04-08 12:55:33,642 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics +- Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized </source> +</li> +</ol> + </section> @@ -128,32 +128,26 @@ $ cat script1-local-results.txt <li><p>Review Pig Script 1 and Pig Script 2. </p> </li> <li><p>Copy the excite.log.bz2 file from the pigtmp directory to the HDFS directory. </p> -</li> -</ol> - <source> $ hadoop fs âcopyFromLocal excite.log.bz2 . </source> -<ol> +</li> + <li><p>Set the HADOOP_CONF_DIR environment variable to the location of your core-site.xml, hdfs-site.xml and mapred-site.xml files. </p> </li> <li><p>Execute the following command (using either script1-hadoop.pig or script2-hadoop.pig): </p> -</li> -</ol> - <source> $ java -cp $PIGDIR/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main script1-hadoop.pig </source> -<ol> -<li><p>Review the result files (located in either the script1-hadoop-results or script2-hadoop-results HDFS directory): </p> </li> -</ol> +<li><p>Review the result files, located in the script1-hadoop-results or script2-hadoop-results HDFS directory: </p> <source> $ hadoop fs -ls script1-hadoop-results $ hadoop fs -cat 'script1-hadoop-results/*' | less </source> - +</li> +</ol> </section> Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml Thu Apr 22 19:35:03 2010 @@ -749,7 +749,7 @@ This enables Pig users/developers to cre <section> <title> Load Functions</title> <p><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup">LoadFunc</a> -abstract class has the main methods for loading data and for most use cases it would suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: </p> +abstract class has the main methods for loading data and for most use cases it would suffice to extend it. There are three other optional interfaces which can be implemented to achieve extended functionality: </p> <ul> <li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup">LoadMetadata</a> @@ -780,7 +780,7 @@ has methods to convert byte arrays to sp <p><strong>Example Implementation</strong></p> <p> -The loader implementation in the example is a loader for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - this is similar to current PigStorage loader in Pig. The implementation uses an existing Hadoop supported !Inputformat - TextInputFormat as the underlying InputFormat. +The loader implementation in the example is a loader for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - this is similar to current PigStorage loader in Pig. The implementation uses an existing Hadoop supported Inputformat - TextInputFormat - as the underlying InputFormat. </p> <source> public class SimpleTextLoader extends LoadFunc { @@ -1239,132 +1239,19 @@ public class IntMax extends EvalFunc< </section> - <section> -<title>Custom Slicer</title> -<p>Sometimes a <code>LoadFunc</code> needs more control over how input is chopped up or even found. </p> -<p>Here are some scenarios that call for a custom slicer: </p> -<ul> -<li><p> Input needs to be chopped up differently than on block boundaries. (Perhaps you want every 1M instead of every 128M. Or, you may want to process in big 1G chunks.) </p> -</li> -<li><p> Input comes from a source outside of HDFS. (Perhaps you are reading from a database.) </p> -</li> -<li><p> There are locality preferences for processing the data that is more than simple HDFS locality. </p> -</li> -<li><p> Extra information needs to be passed from the client machine to the <code>LoadFunc</code> instances running remotely. </p> -</li> -</ul> -<p>All of these scenarios are addressed by slicers. There are two parts to the slicing framework: <code>Slicer</code>, the class that creates slices, and <code>Slice</code>, the class that represents a particular piece of the input. Slicing kicks in when Pig sees that the <code>LoadFunc</code> implements the <code>Slicer</code> interface. </p> - - -<section> -<title>Slicer</title> - -<p>The slicer has two basic functions: validate input and slice up the input. Both of these methods will be called on the client machine. </p> - -<source> -public interface Slicer { - void validate(DataStorage store, String location) throws IOException; - Slice[] slice(DataStorage store, String location) throws IOException; -} -</source> -</section> - -<section> -<title>Slice</title> - -<p>Each slice describes a unit of work and will correspond to a map task in Hadoop. </p> - -<source> -public interface Slice extends Serializable { - String[] getLocations(); - void init(DataStorage store) throws IOException; - long getStart(); - long getLength(); - void close() throws IOException; - long getPos() throws IOException; - float getProgress() throws IOException; - boolean next(Tuple value) throws IOException; -} -</source> - -<p>Only one of the methods is used for scheduling: <code>getLocations()</code>. This method allows the implementor to give hints to Pig about where the task should be run. It is only a hint. If things are busy, the task may get scheduled elsewhere. </p> -<p>The rest of the <code>Slice</code> methods are used to read records on the processing nodes. <code>init</code> is called right after the <code>Slice</code> object is deserialized and <code>close</code> is called after the last record has been read. The Pig runtime will read records from the <code>Slice</code> until <code>getPos()</code> exceeds <code>getLength()</code>. Because <code>Slice</code> implements serializable, <code>Slicer</code> can encode information in the <code>Slice</code> that will later be available when the task is run. </p> - -</section> <section> -<title> Example</title> - -<p>This example shows a simple <code>Slicer</code> that gets a count from the input stream and generates that number of <code>Slice</code> s. </p> - -<source> -public class RangeSlicer implements Slicer, LoadFunc { - /** - * Expects location to be a Stringified integer, and makes - * Integer.parseInt(location) slices. Each slice generates a single value, - * its index in the sequence of slices. - */ - public Slice[] slice (DataStorage store, String location) throws IOException { - // Note: validate has already made sure that location is an integer - int numslices = Integer.parseInt(location); - Slice[] slices = new Slice[numslices]; - for (int i = 0; i slices.length; i++) { - slices[i] = new SingleValueSlice(i); - } - return slices; - } - public void validate(DataStorage store, String location) throws IOException { - try { - Integer.parseInt(location); - } catch (NumberFormatException nfe) { - throw new IOException(nfe.getMessage()); - } - } - /** - * A Slice that returns a single value from next. - */ - public static class SingleValueSlice implements Slice { - // note this value is set by the Slicer and will get serialized and deserialized at the remote processing node - public int val; - // since we just have a single value, we can use a boolean rather than a counter - private transient boolean read; - public SingleValueSlice (int value) { - this.val = value; - } - public void close () throws IOException {} - public long getLength () { return 1; } - public String[] getLocations () { return new String[0]; } - public long getStart() { return 0; } - public long getPos () throws IOException { return read ? 1 : 0; } - public float getProgress () throws IOException { return read ? 1 : 0; } - public void init (DataStorage store) throws IOException {} - public boolean next (Tuple value) throws IOException { - if (!read) { - value.appendField(new DataAtom(val)); - read = true; - return true; - } - return false; - } - private static final long serialVersionUID = 1L; - } -} -</source> +<title>Passing Configurations to UDFs</title> +<p>The singleton UDFContext class provides two features to UDF writers. First, on the backend, it allows UDFs to get access to the JobConf object, by calling getJobConf. This is only available on the backend (at run time) as the JobConf has not yet been constructed on the front end (during planning time).</p> -<p>You can invoke the <code>RangeSlicer</code> class with the following Pig Latin statement: </p> - -<source> -LOAD '27' USING RangeSlicer(); -</source> +<p>Second, it allows UDFs to pass configuration information between instantiations of the UDF on the front and backends. UDFs can store information in a configuration object when they are constructed on the front end, or during other front end calls such as describeSchema. They can then read that information on the backend when exec (for EvalFunc) or getNext (for LoadFunc) is called. Note that information will not be passed between instantiations of the function on the backend. The communication channel only works from front end to back end.</p> +<p>To store information, the UDF calls getUDFProperties. This returns a Properties object which the UDF can record the information in or read the information from. To avoid name space conflicts UDFs are required to provide a signature when obtaining a Properties object. This can be done in two ways. The UDF can provide its Class object (via this.getClass()). In this case, every instantiation of the UDF will be given the same Properties object. The UDF can also provide its Class plus an array of Strings. The UDF can pass its constructor arguments, or some other identifying strings. This allows each instantiation of the UDF to have a different properties object thus avoiding name space collisions between instantiations of the UDF.</p> </section> </section> - -</section> - </body> </document> Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml Thu Apr 22 19:35:03 2010 @@ -36,7 +36,7 @@ <!-- HADOOP M/R API--> <section> <title>Hadoop MapReduce APIs</title> - <p>Zebra requires Hadoop 20. This release of Zebra supports the "new" jobContext-style MapReduce APIs. </p> + <p>This release of Zebra supports the "new" jobContext-style MapReduce APIs. </p> <ul> <li>org.apache.hadoop.mapreduce.* - supported ("new" jobContext-style mapreduce API)</li> <li>org.apache.hadoop.mapred.* - supported, but deprecated ("old" jobConf-style mapreduce API)</li> @@ -49,18 +49,7 @@ <!-- ZEBRA API--> <section> <title>Zebra MapReduce APIs</title> - <p>Zebra includes several classes for use in MapReduce programs, located here (.....).</p> - <p>Please note these APIs. The main entry point into Zebra are the two classes for reading and writing tables, namely TableInputFormat and BasicTableOutputFormat. </p> - <ul> - <li>BasicTableOutputFormat</li> - <li>TableInputformat</li> - <li>TableRecordReader</li> - <li>ZebraOutputPartition</li> - <li>ZebraProjection</li> - <li>ZebraSchema</li> - <li>ZebraStorageHint</li> - <li>ZebraSortInfo</li> - </ul> + <p>Zebra includes several classes for use in MapReduce programs. The main entry point into Zebra are the two classes for reading and writing tables, namely TableInputFormat and BasicTableOutputFormat. </p> </section> <!-- END ZEBRA API--> Modified: hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml?rev=937032&r1=937031&r2=937032&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml (original) +++ hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/zebra_overview.xml Thu Apr 22 19:35:03 2010 @@ -42,16 +42,17 @@ <title>Prerequisites</title> <p>Zebra requires:</p> <ul> - <li>Pig 0.7.0 or later</li> - <li>Hadoop 0.20.1 or later</li> + <li>Pig 0.7.0 or later </li> + <li>Hadoop 0.20.2 or later</li> </ul> <p></p> <p>Also, make sure the following software is installed on your system:</p> <ul> <li>JDK 1.6</li> <li>Ant 1.7.1</li> - <li>Javacc 4.2</li> </ul> + <p></p> + <p><strong>Note:</strong> Zebra requires Pig.jar in its classpath to compile and run.</p> </section> <section>