Author: olga Date: Wed Jan 20 19:03:57 2010 New Revision: 901333 URL: http://svn.apache.org/viewvc?rev=901333&view=rev Log: PIG-1192: Pig 0.6 Docs fixes (chandec via olgan)
Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/index.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/setup.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/site.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_pig.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_users.xml Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/CHANGES.txt?rev=901333&r1=901332&r2=901333&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/CHANGES.txt (original) +++ hadoop/pig/branches/branch-0.6/CHANGES.txt Wed Jan 20 19:03:57 2010 @@ -26,6 +26,8 @@ IMPROVEMENTS +PIG-1192: Pig 0.6 Docs fixes (chandec via olgan) + PIG-1177: Pig 0.6 Docs - Zebra docs (chandec via olgan) PIG-1175: Pig 0.6 Docs - Store v. Dump (chandec via olgan) Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=901333&r1=901332&r2=901333&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/cookbook.xml Wed Jan 20 19:03:57 2010 @@ -36,7 +36,7 @@ <section> <title>Use Optimization</title> -<p>Pig supports various <a href="piglatin_users.html#Optimization+Rules">optimization rules</a> which are turned on by default. +<p>Pig supports various <a href="piglatin_ref1.html#Optimization+Rules">optimization rules</a> which are turned on by default. Become familiar with these rules.</p> </section> @@ -220,29 +220,34 @@ <section> <title>Specialized Join Optimizations</title> <p>Optimization can also be achieved using fragment replicate joins, skewed joins, and merge joins. -For more information see <a href="piglatin_users.html#Specialized+Joins">Specialized Joins</a>.</p> +For more information see <a href="piglatin_ref1.html#Specialized+Joins">Specialized Joins</a>.</p> </section> </section> <section> -<title>Use the PARALLEL Keyword</title> +<title>Use the PARALLEL Clause</title> -<p>PARALLEL controls the number of reducers invoked by Hadoop. The default value is 1. However, the number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 500 MB of data behaves efficiently.</p> +<p>Use the PARALLEL clause to increase the parallelism of a job:</p> +<ul> +<li>PARALLEL sets the number of reduce tasks for the MapReduce jobs generated by Pig. The default value is 1 (one reduce task).</li> +<li>PARALLEL only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. </li> +<li>If you donât specify PARALLEL, you still get the same map parallelism but only one reduce task.</li> +</ul> +<p></p> +<p>As noted, the default value for PARALLEL is 1 (one reduce task). However, the number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 500 MB of data behaves efficiently.</p> -<p>The keyword makes sense with any operator that starts a reduce phase. This includes -<a href="piglatin_reference.html#COGROUP">COGROUP</a>, -<a href="piglatin_reference.html#CROSS">CROSS</a>, -<a href="piglatin_reference.html#DISTINCT">DISTINCT</a>, -<a href="piglatin_reference.html#GROUP">GROUP</a>, -<a href="piglatin_reference.html#JOIN">JOIN</a>, -<a href="piglatin_reference.html#ORDER">ORDER</a>, and -<a href="piglatin_reference.html#JOIN%2C+OUTER">OUTER JOIN</a>. - -</p> - -<p>You can set the value of PARALLEL in your scripts in conjunction with the operator (see the example below). You can also set the value of PARALLEL for all scripts using the <a href="piglatin_reference.html#set">set</a> command.</p> +<p>You can include the PARALLEL clause with any operator that starts a reduce phase (see the example below). This includes +<a href="piglatin_ref2.html#COGROUP">COGROUP</a>, +<a href="piglatin_ref2.html#CROSS">CROSS</a>, +<a href="piglatin_ref2.html#DISTINCT">DISTINCT</a>, +<a href="piglatin_ref2.html#GROUP">GROUP</a>, +<a href="piglatin_ref2.html#JOIN+%28inner%29">JOIN (inner)</a>, +<a href="piglatin_ref2.html#JOIN+%28outer%29">JOIN (outer)</a>, and +<a href="piglatin_ref2.html#ORDER">ORDER</a>. + +You can also set the value of PARALLEL for all scripts using the <a href="piglatin_ref2.html#set">set</a> command.</p> <p>Example</p> @@ -251,7 +256,6 @@ B = group A by t PARALLEL 18; ..... </source> - </section> <section> Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/index.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/index.xml?rev=901333&r1=901332&r2=901333&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/index.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/index.xml Wed Jan 20 19:03:57 2010 @@ -31,8 +31,8 @@ Then try out the <a href="tutorial.html">Pig Tutorial</a> to get an idea of how easy it is to use Pig. </p> <p> - When you are ready to start writing your own scripts, read through the <a href="piglatin_users.html">Pig Latin Users Guide</a> - and the <a href="piglatin_reference.html">Pig Latin Reference Manual</a> to become familiar with Pig's features. + When you are ready to start writing your own scripts, read through the Pig Latin Reference <a href="piglatin_ref1.html">Manual 1</a> + and <a href="piglatin_ref2.html">Manual 2</a> to become familiar with Pig's features. Also review the <a href="cookbook.html">Pig Cookbook</a> to learn how to tweak your code for optimal performance. </p> <p> Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/setup.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/setup.xml?rev=901333&r1=901332&r2=901333&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/setup.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/setup.xml Wed Jan 20 19:03:57 2010 @@ -71,7 +71,8 @@ <section> <title>Grunt Shell</title> <p>Use Pig's interactive shell, Grunt, to enter pig commands manually. See the <a href="setup.html#Sample+Code">Sample Code</a> for instructions about the passwd file used in the example.</p> -<p>You can also run or execute script files from the Grunt shell. See the RUN and EXEC commands in the <a href="piglatin_reference.html">Pig Latin Reference Manual</a>. </p> +<p>You can also run or execute script files from the Grunt shell. +See the <a href="piglatin_ref2.html#run">run</a> and <a href="piglatin_ref2.html#exec">exec</a> commands. </p> <p><strong>Local Mode</strong></p> <source> $ pig -x local Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/site.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/site.xml?rev=901333&r1=901332&r2=901333&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/site.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/site.xml Wed Jan 20 19:03:57 2010 @@ -45,8 +45,8 @@ <tutorial label="Tutorial" href="tutorial.html" /> </docs> <docs label="Guides"> - <plusers label="Pig Latin Users " href="piglatin_users.html" /> - <plref label="Pig Latin Reference" href="piglatin_reference.html" /> + <plref1 label="Pig Latin 1" href="piglatin_ref1.html" /> + <plref2 label="Pig Latin 2" href="piglatin_ref2.html" /> <cookbook label="Cookbook" href="cookbook.html" /> <udf label="UDFs" href="udf.html" /> </docs> Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_pig.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_pig.xml?rev=901333&r1=901332&r2=901333&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_pig.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_pig.xml Wed Jan 20 19:03:57 2010 @@ -29,7 +29,7 @@ <section> <title>Overview</title> <p>With Pig you can load and store data in Zebra format. You can also take advantage of sorted Zebra tables for map-side groups and merge joins. When working with Pig keep in mind that, unlike MapReduce, you do not need to declare Zebra schemas. Zebra automatically converts Zebra schemas to Pig schemas (and vice versa) for you.</p> - + </section> <!-- END OVERVIEW--> @@ -54,19 +54,19 @@ <ol> <li>You need to register a Zebra jar file the same way you would do it for any other UDF.</li> <li>You need to place the jar on your classpath.</li> - <li>When using Zebra with Pig, Zebra data is self-described and always contains a schema. This means that the AS clause is unnecessary as long as - you know what the column names and types are. To determine the column names and types, you can run the DESCRIBE statement right after the load: + </ol> + + <p>Zebra data is self-described meaning that the name and type information is stored with the data; you don't need to provide an AS clause or perform type casting unless you actually need to change the data. To check column names and types, you can run the DESCRIBE statement right after the load:</p> <source> A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader(); DESCRIBE A; -a: {name: chararray,age: int,gpa: float} +A: {name: chararray,age: int,gpa: float} </source> - </li> - </ol> -<p>You can provide alternative names to the columns with the AS clause. You can also provide types as long as the - original type can be converted to the new type. <em>In general</em>, Zebra supports Pig type compatibilities - (see <a href="piglatin_reference.html#Arithmetic+Operators+and+More">Arithmetic Operators and More</a>).</p> +<p>You can provide alternative names to the columns with the AS clause. You can also provide alternative types as long as the + original type can be converted to the new type. (One exception to this rule are maps since you can't specify schema for a map. Zebra always creates map values as bytearrays which would require casting to real type in the script. Note that this is not different for treating maps in Pig for any other storage.) For more information see <a href="piglatin_ref2.html#Schemas">Schemas</a> and +<a href="piglatin_ref2.html#Arithmetic+Operators+and+More">Arithmetic Operators and More</a>. + </p> <p>You can provide multiple, comma-separated files to the loader:</p> <source> @@ -186,7 +186,8 @@ <section> <title>HDFS File Globs</title> <p>Pig supports HDFS file globs - (for more information about globs, see <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html">FileSystem</a> and GlobStatus).</p> + (for more information + see <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)">GlobStatus</a>).</p> <p>In this example, all Zebra tables in the directory of /path/to/PIG/tables will be loaded as a union (table union). </p> <source> A = LOAD â/path/to/PIG/tables/*â USING org.apache.hadoop.zebra.pig.TableLoader(ââ); Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_users.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_users.xml?rev=901333&r1=901332&r2=901333&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_users.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/zebra_users.xml Wed Jan 20 19:03:57 2010 @@ -155,7 +155,7 @@ <section> <title>MapReduce Jobs</title> <p> -TableInputFormat has static method, requireSortedTable, that allows the caller to specify the behavior of a single sorted table or an order-preserving sorted table union as described above. The method ensures all tables in a union are sorted. For more information, see <a href="zebra_reference.html#TableInputFormat">TableInputFormat</a>. +TableInputFormat has static method, requireSortedTable, that allows the caller to specify the behavior of a single sorted table or an order-preserving sorted table union as described above. The method ensures all tables in a union are sorted. For more information, see <a href="zebra_mapreduce.html#TableInputFormat">TableInputFormat</a>. </p> <p>One simple example: A order-preserving sorted union B. A and B are sorted tables. </p>