Author: olga Date: Tue Jan 5 17:25:58 2010 New Revision: 896137 URL: http://svn.apache.org/viewvc?rev=896137&view=rev Log: PIG-1175: Pig 0.6 Docs - Store v. Dump (chandec via olgan)
Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/CHANGES.txt?rev=896137&r1=896136&r2=896137&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/CHANGES.txt (original) +++ hadoop/pig/branches/branch-0.6/CHANGES.txt Tue Jan 5 17:25:58 2010 @@ -26,6 +26,8 @@ IMPROVEMENTS +PIG-1175: Pig 0.6 Docs - Store v. Dump (chandec via olgan) + PIG-1162: Pig 0.6.0 - UDF doc (chandec via olgan) PIG-1163: Pig/Zebra 0.6.0 release (chandec via olgan) Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml?rev=896137&r1=896136&r2=896137&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml Tue Jan 5 17:25:58 2010 @@ -4919,58 +4919,7 @@ </section></section> - <section> - <title>DUMP</title> - <para>Displays the contents of a relation.</para> - - <section> - <title>Syntax</title> - <informaltable frame="all"> - <tgroup cols="1"><tbody><row> - <entry> - <para>DUMP alias;Â Â Â Â </para> - </entry> - </row></tbody></tgroup> - </informaltable></section> - - <section> - <title>Terms</title> - <informaltable frame="all"> - <tgroup cols="2"><tbody><row> - <entry> - <para>alias</para> - </entry> - <entry> - <para>The name of a relation.</para> - </entry> - </row></tbody></tgroup> - </informaltable></section> - - <section> - <title>Usage</title> - <para>Use the DUMP operator to run (execute) a Pig Latin statement and to display the contents of an alias. You can use DUMP as a debugging device to make sure the results you are expecting are being generated.</para></section> - - <section> - <title>Example</title> - <para>In this example a dump is performed after each statement.</para> -<programlisting> -A = LOAD 'student' AS (name:chararray, age:int, gpa:float); - -DUMP A; -(John,18,4.0F) -(Mary,19,3.7F) -(Bill,20,3.9F) -(Joe,22,3.8F) -(Jill,20,4.0F) - -B = FILTER A BY name matches 'J.+'; - -DUMP B; -(John,18,4.0F) -(Joe,22,3.8F) -(Jill,20,4.0F) -</programlisting> -</section></section> + <section> <title>FILTER </title> @@ -6521,7 +6470,7 @@ <section> <title>STORE </title> - <para>Stores data to the file system.</para> + <para>Stores or saves results to the file system.</para> <section> <title>Syntax</title> @@ -6591,7 +6540,10 @@ <section> <title>Usage</title> - <para>Use the STORE operator to run (execute) Pig Latin statements and to store data on the file system. </para></section> + <para>Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to the file system. Use STORE for production scripts and batch mode processing.</para> + + <para>Note: To debug scripts during development, you can use <ulink url="piglatin_reference.html#DUMP">DUMP</ulink> to check intermediate results.</para> +</section> <section> <title>Examples</title> @@ -6962,6 +6914,68 @@ </section></section> + + <section> + <title>DUMP</title> + <para>Dumps or displays results to screen.</para> + + <section> + <title>Syntax</title> + <informaltable frame="all"> + <tgroup cols="1"><tbody><row> + <entry> + <para>DUMP alias;Â Â Â Â </para> + </entry> + </row></tbody></tgroup> + </informaltable></section> + + <section> + <title>Terms</title> + <informaltable frame="all"> + <tgroup cols="2"><tbody><row> + <entry> + <para>alias</para> + </entry> + <entry> + <para>The name of a relation.</para> + </entry> + </row></tbody></tgroup> + </informaltable></section> + + <section> + <title>Usage</title> + <para>Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen. DUMP is meant for interactive mode; statements are executed immediately and the results are not saved (persisted). You can use DUMP as a debugging device to make sure that the results you are expecting are actually generated. </para> + + <para> + Note that production scripts <emphasis>should not</emphasis> use DUMP as it will disable multi-query optimizations and is likely to slow down execution + (see <ulink url="piglatin_users.html#Store+vs.+Dump">Store vs. Dump</ulink>). + </para> + </section> + + <section> + <title>Example</title> + <para>In this example a dump is performed after each statement.</para> +<programlisting> +A = LOAD 'student' AS (name:chararray, age:int, gpa:float); + +DUMP A; +(John,18,4.0F) +(Mary,19,3.7F) +(Bill,20,3.9F) +(Joe,22,3.8F) +(Jill,20,4.0F) + +B = FILTER A BY name matches 'J.+'; + +DUMP B; +(John,18,4.0F) +(Joe,22,3.8F) +(Jill,20,4.0F) +</programlisting> +</section></section> + + + <section> <title>EXPLAIN</title> <para>Displays execution plans.</para> Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml?rev=896137&r1=896136&r2=896137&view=diff ============================================================================== --- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml (original) +++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml Tue Jan 5 17:25:58 2010 @@ -54,7 +54,7 @@ <section> <title>Running Pig Latin </title> - <p>You can execute Pig Latin statements interactively or in batch mode using Pig scripts (see the EXEC and RUN operators).</p> + <p>You can execute Pig Latin statements interactively or in batch mode using Pig scripts (see the <a href="piglatin_reference.html#exec">exec</a> and <a href="piglatin_reference.html#run">run</a> commands).</p> <p>Grunt Shell, Interactive or Batch Mode</p> <source> @@ -228,15 +228,12 @@ <!-- MULTI-QUERY EXECUTION--> <section> <title>Multi-Query Execution</title> -<p>With multi-query execution Pig processes an entire script or a batch of statements at once -(as opposed to processing statements when a DUMP or STORE is encountered). </p> - - +<p>With multi-query execution Pig processes an entire script or a batch of statements at once.</p> <section> <title>Turning Multi-Query Execution On or Off</title> <p>Multi-query execution is turned on by default. - To turn it off and revert to Pi'gs "execute-on-dump/store" behavior, use the "-M" or "-no_multiquery" options. </p> + To turn it off and revert to Pig's "execute-on-dump/store" behavior, use the "-M" or "-no_multiquery" options. </p> <p>To run script "myscript.pig" without the optimization, execute Pig as follows: </p> <source> $ pig -M myscript.pig @@ -253,7 +250,8 @@ <li> <p>For batch mode execution, the entire script is first parsed to determine if intermediate tasks can be combined to reduce the overall amount of work that needs to be done; execution starts only after the parsing is completed -(see the EXPLAIN operator and the EXEC and RUN commands). </p> +(see the <a href="piglatin_reference.html#EXPLAIN">EXPLAIN</a> operator and the <a href="piglatin_reference.html#exec">exec</a> and <a href="piglatin_reference.html#run">run</a> commands). </p> + </li> <li> <p>Two run scenarios are optimized, as explained below: explicit and implicit splits, and storing intermediate results.</p> @@ -316,7 +314,32 @@ </section> </section> +<section> + <title>Store vs. Dump</title> + <p>With multi-query exection, you want to use <a href="piglatin_reference.html#STORE">STORE</a> to save (persist) your results. + You do not want to use <a href="piglatin_reference.html#DUMP">DUMP</a> as it will disable multi-query execution and is likely to slow down execution. (If you have included DUMP statements in your scripts for debugging purposes, you should remove them.) </p> + + <p>DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A > B > DUMP while the second job will execute A > B > C > STORE.</p> + +<source> +A = LOAD âinputâ AS (x, y, z); +B = FILTER A BY x > 5; +DUMP B; +C = FOREACH B GENERATE y, z; +STORE C INTO âoutputâ; +</source> + + <p>STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2.</p> + +<source> +A = LOAD âinputâ AS (x, y, z); +B = FILTER A BY x > 5; +STORE B INTO âoutput1â; +C = FOREACH B GENERATE y, z; +STORE C INTO âoutput2â; +</source> +</section> <section> <title>Error Handling</title> <p>With multi-query execution Pig processes an entire script or a batch of statements at once. @@ -352,10 +375,10 @@ <title>Backward Compatibility</title> <p>Most existing Pig scripts will produce the same result with or without the multi-query execution. - There are cases though were this is not true. Path names and schemes are discussed here.</p> + There are cases though where this is not true. Path names and schemes are discussed here.</p> <p>Any script is parsed in it's entirety before it is sent to execution. Since the current directory can change - throughout the script any path used in load or store is translated to a fully qualified and absolute path.</p> + throughout the script any path used in LOAD or STORE statement is translated to a fully qualified and absolute path.</p> <p>In map-reduce mode, the following script will load from "hdfs://<host>:<port>/data1" and store into "hdfs://<host>:<port>/tmp/out1". </p> <source> @@ -375,7 +398,7 @@ <li><p>Specify a custom scheme for the LoadFunc/Slicer </p></li> </ol> - <p>Arguments used in a load statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p> + <p>Arguments used in a LOAD statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p> <p>In the SQL case, the SQLLoader function is invoked with "sql://mytable". </p> <source> @@ -416,7 +439,7 @@ <section> <title>Example</title> -<p>In this script, the store/load operators have different file paths; however, the load operator depends on the store operator.</p> +<p>In this script, the STORE/LOAD operators have different file paths; however, the LOAD operator depends on the STORE operator.</p> <source> A = LOAD '/user/xxx/firstinput' USING PigStorage(); B = group ....