[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- 1. Download and unzip the Pig tutorial file (*.gz) to your local directory. 1. Review the contents of the [#Pig_Tutorial_File Pig Tutorial File]. + 1. Review the [#Tutorial_Pig_Script Tutorial Pig Script] and the[#Tutorial_Join_Pig_Script Tutorial-Join Pig Script]. - 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) + 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig). {{{ $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig }}} - 1.#4 Review the results: + 1.#5 Review the results: {{{ $ ls -l /tmp/ngrams.txt }}} @@ -47, +48 @@ 1. Download and unzip the Pig tutorial file (*.gz) to your local directory. 1. Review the contents of the [#Pig_Tutorial_File Pig Tutorial File]. + 1. Review the [#Tutorial_Pig_Script Tutorial Pig Script] and the[#Tutorial_Join_Pig_Script Tutorial-Join Pig Script]. 1. Copy the exite.log file to your DFS directory. {{{ $ hadoop dfs âcopyFromLocal tutorial/excite.log . }}} - 1.#4 Set the HADOOPSITEPATH environment variable to the location of your hadoop-site.xml file. + 1.#5 Set the HADOOPSITEPATH environment variable to the location of your hadoop-site.xml file. 1. Execute the following command (using either tutorial.pig or tutorial-join.pig): {{{ $ java -cp pig.jar:$HADOOPSITEPATH org.apache.pig.Main tutorial.pig }}} - 1.#6 Review the results: + 1.#7 Review the results: {{{ $ hadoop dfs -ls ngrams.txt }}} @@ -86, +88 @@ || !TutorialUtil || Divides the query string into a set of words.|| + [[Anchor(Tutorial_Pig_Script)]] == Tutorial Pig Script == The tutorial pig script (tutorial.pig or tutorial-local.pig) does the following: @@ -108, +111 @@ * Uses the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' + [[Anchor(Tutorial_Join_Pig_Script)]] == Tutorial-Join Pig Script == The tutorial-join pig script (tutorial-join.pig or tutorial-join-local.pig) does the following:
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- - = Pig Tutorial = - (''page in progress ...'') The Pig tutorial shows you how to run Pig scripts in local mode or on a Hadoop cluster.
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- To install and run the Pig scripts in local mode, do the following: 1. Download and unzip the Pig tutorial file (*.gz) to your local directory. - 1. Review the contents of the Pig tutorial file. + 1. Review the contents of the [#Pig_Tutorial_File Pig Tutorial File]. 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) {{{ $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig @@ -48, +48 @@ To install and run the Pig scripts on a Hadoop cluster, do the following: 1. Download and unzip the Pig tutorial file (*.gz) to your local directory. - 1. Review the contents of the Pig tutorial file. + 1. Review the contents of the [#Pig_Tutorial_File Pig Tutorial File]. 1. Copy the exite.log file to your DFS directory. {{{ $ hadoop dfs âcopyFromLocal tutorial/excite.log . @@ -63, +63 @@ $ hadoop dfs -ls ngrams.txt }}} + [[Anchor(Pig_Tutorial_File)]] == Pig Tutorial File == The contents of the Pig tutorial file (*.gz) are described here. || '''File''' || '''Description'''||
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- The tutorial pig script (tutorial.pig or tutorial-local.pig) does the following: * Registers the tutorial JAR file so that the user-defined functions (UDFs) can be called in the script. - * Uses the !PigStorage function to load the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields '''user''', '''time''', and '''query'''. + * Uses the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to load the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields '''user''', '''time''', and '''query'''. * Calls the NonURLDetector UDF to remove records if the query field is empty or a URL. * Calls the !ToLower UDF to change the query field to lowercase. * Calls the !NonPornDetector UDF to remove records if the query field contains porn terms. @@ -100, +100 @@ * Calls the N!GramGenerator UDF to compose the n-grams of the query. * Uses the [http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data DISTINCT] command to get the unique n-grams for all records. * Uses the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group records by n-gram and hour. - * Uses the COUNT function to get the count (occurrences) of each n-gram. + * Uses the [http://wiki.apache.org/pig/PigBuiltins COUNT] function to get the count (occurrences) of each n-gram. * Uses the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group records by n-gram only. * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the n-gram. * Uses the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to assign names to the fields. * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to move all records with a score less than or equal to 2.0. * Uses the [http://wiki.apache.org/pig/PigLatin#ORDER:_Sorting_data_according_to_some_field ORDER] command to sort the remaining records by hour and score. - * Uses the !PigStorage function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' + * Uses the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' == Tutorial-Join Pig Script == @@ -114, +114 @@ The tutorial-join pig script (tutorial-join.pig or tutorial-join-local.pig) does the following: * Registers the tutorial JAR file so that the user-defined functions (UDFs) can be called in the script. - * Uses the !PigStorage function to load the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields '''user''', '''time''', and '''query'''. + * Uses the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to load the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields '''user''', '''time''', and '''query'''. * Calls the NonURLDetector UDF to remove records if the query field is empty or a URL. * Calls the !ToLower UDF to change the query field to lowercase. * Calls the Non!PornDetector UDF to remove records if the query field contains porn terms. @@ -122, +122 @@ * Calls the N!GramGenerator UDF to compose the n-grams of the query. * Uses the [http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data DISTINCT] command to get the unique n-grams for all records. * Uses the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group the records by n-gram and hour. - * Uses the COUNT function to get the count (occurrences) of each n-gram. + * Uses the [http://wiki.apache.org/pig/PigBuiltins COUNT] function to get the count (occurrences) of each n-gram. * Uses the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to assign names to the fields. * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â00â * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â12â *
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- * Calls the !NonPornDetector UDF to remove records if the query field contains porn terms. * Calls the !ExtractHour UDF to extract the hour from the time field. * Calls the N!GramGenerator UDF to compose the n-grams of the query. - * Uses the DISTINCT command to get the unique n-grams for all records. - * Uses the GROUP command to group records by n-gram and hour. + * Uses the [http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data DISTINCT] command to get the unique n-grams for all records. + * Uses the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group records by n-gram and hour. * Uses the COUNT function to get the count (occurrences) of each n-gram. - * Uses the GROUP command to group records by n-gram only. + * Uses the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group records by n-gram only. * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the n-gram. * Uses the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to assign names to the fields. * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to move all records with a score less than or equal to 2.0. - * Uses the ORDER command to sort the remaining records by hour and score. + * Uses the [http://wiki.apache.org/pig/PigLatin#ORDER:_Sorting_data_according_to_some_field ORDER] command to sort the remaining records by hour and score. * Uses the !PigStorage function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' @@ -120, +120 @@ * Calls the Non!PornDetector UDF to remove records if the query field contains porn terms. * Calls the !ExtractHour UDF to extract the hour from the time field. * Calls the N!GramGenerator UDF to compose the n-grams of the query. - * Uses the DISTINCT command to get the unique n-grams for all records. - * Uses the GROUP command to group the records by n-gram and hour. + * Uses the [http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data DISTINCT] command to get the unique n-grams for all records. + * Uses the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group the records by n-gram and hour. * Uses the COUNT function to get the count (occurrences) of each n-gram. * Uses the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to assign names to the fields. * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â00â
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- * Uses the COUNT function to get the count (occurrences) of each n-gram. * Uses the GROUP command to group records by n-gram only. * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the n-gram. - * Uses the GENERATE command to assign names to the fields. + * Uses the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to assign names to the fields. * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to move all records with a score less than or equal to 2.0. * Uses the ORDER command to sort the remaining records by hour and score. * Uses the !PigStorage function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' @@ -123, +123 @@ * Uses the DISTINCT command to get the unique n-grams for all records. * Uses the GROUP command to group the records by n-gram and hour. * Uses the COUNT function to get the count (occurrences) of each n-gram. - * Uses the GENERATE command to assign names to the fields. + * Uses the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to assign names to the fields. * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â00â * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â12â * Uses the [http://wiki.apache.org/pig/PigLatin#Joining JOIN] command to join the n-grams in hour â00â and hour â12â by field $0
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- * Uses the GENERATE command to assign names to the fields. * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â00â * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â12â - * Uses the JOIN command to join the n-grams in hour â00â and hour â12â by field $0 + * Uses the [http://wiki.apache.org/pig/PigLatin#Joining JOIN] command to join the n-grams in hour â00â and hour â12â by field $0 * Uses the COUNT function to get the count (occurrences) of the n-grams in both â00â and â12â * Uses the !PigStorage function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''count00''', '''count12'''
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- * Uses the GROUP command to group records by n-gram only. * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the n-gram. * Uses the GENERATE command to assign names to the fields. - * Uses the FILTER command to move all records with a score less than or equal to 2.0. + * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to move all records with a score less than or equal to 2.0. * Uses the ORDER command to sort the remaining records by hour and score. * Uses the !PigStorage function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' @@ -124, +124 @@ * Uses the GROUP command to group the records by n-gram and hour. * Uses the COUNT function to get the count (occurrences) of each n-gram. * Uses the GENERATE command to assign names to the fields. - * Uses the FILTER command to get the n-grams for hour â00â - * Uses the FILTER command to get the n-grams for hour â12â + * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â00â + * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour â12â * Uses the JOIN command to join the n-grams in hour â00â and hour â12â by field $0 * Uses the COUNT function to get the count (occurrences) of the n-grams in both â00â and â12â * Uses the !PigStorage function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''count00''', '''count12'''
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- 1. Define an environment variable with the location of the Pig JAR file. For example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh). - == Pig Scripts - Local Mode == + == Pig Script Installation and Run - Local Mode == To install and run the Pig scripts in local mode, do the following: - 1. Download and unzip the Pig tutorial file (*.gz) to your local directory (the Pig tutorial file is described below). + 1. Download and unzip the Pig tutorial file (*.gz) to your local directory. + 1. Review the contents of the Pig tutorial file. 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) {{{ $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig }}} - 1.#3 Review the results: + 1.#4 Review the results: {{{ $ ls -l /tmp/ngrams.txt }}} - == Pig Scripts - Hadoop Cluster == + == Pig Script Installation and Run - Hadoop Cluster == To install and run the Pig scripts on a Hadoop cluster, do the following: - 1. Download and unzip the Pig tutorial file (*.gz) to your local directory (the Pig tutorial file is described below). + 1. Download and unzip the Pig tutorial file (*.gz) to your local directory. + 1. Review the contents of the Pig tutorial file. 1. Copy the exite.log file to your DFS directory. {{{ $ hadoop dfs âcopyFromLocal tutorial/excite.log . }}} - 1.#3 Set the HADOOPSITEPATH environment variable to the location of your hadoop-site.xml file. + 1.#4 Set the HADOOPSITEPATH environment variable to the location of your hadoop-site.xml file. 1. Execute the following command (using either tutorial.pig or tutorial-join.pig): {{{ $ java -cp pig.jar:$HADOOPSITEPATH org.apache.pig.Main tutorial.pig }}} - 1.#5 Review the results: + 1.#6 Review the results: {{{ $ hadoop dfs -ls ngrams.txt }}} @@ -65, +67 @@ The contents of the Pig tutorial file (*.gz) are described here. || '''File''' || '''Description'''|| || tutorial.jar|| User-defined functions (UDFs) || - || tutorial.pig || Tutorial-1 (run on Hadoop) || + || tutorial.pig || Tutorial pig script (Hadoop) || - || tutorial-local.pig ||Tutorail-1 (run in local mode) || + || tutorial-local.pig ||Tutorail pig script(local mode) || - || tutorial-join.pig || Tutorial-2 (run on Hadoop) || + || tutorial-join.pig || Tutorial-join pig script(Hadoop) || - || tutorial-join-local.pig || Tutorial-2 (run in local mode) || + || tutorial-join-local.pig || Tutorial-join pig script(local mode) || - || excite.log || Data file (for runs on Hadoop) || + || excite.log || Data file (Hadoop) || - || excite-small.log || Data file (for runs in local mode) || + || excite-small.log || Data file (local mode) || || pornwords || Data file (porn keywords) || The user-defined functions (UDFs) are described here. || '''UDF''' || '''Description'''|| || !ExtractHour || Extracts the hour from the record.|| - || N!GramGenerator || Extracts n-grams from the set of words. || + || N!GramGenerator || Composes n-grams from the set of words. || || !NonPornDetector|| Removes the record if the query field includes porn terms. || || NonURLDetector || Removes the record if the query field is empty or a URL. || || !ScoreGenerator || Calculates a "popularity" score for the n-gram.|| @@ -85, +87 @@ || !TutorialUtil || Divides the query string into a set of words.|| - == Pig Tutorial-1 == + == Tutorial Pig Script == - The Pig Tutorial-1 script (tutorial.pig or tutorial-local.pig) does the following: + The tutorial pig script (tutorial.pig or tutorial-local.pig) does the following: - * Registers the tutorial JAR file so that the included user-defined functions (UDFs) can be called in the script. + * Registers the tutorial JAR file so that the user-defined functions (UDFs) can be called in the script. - * Loads the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields user, time, and query. + * Uses the !PigStorage function to load the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields '''user''', '''time''', and '''query'''. - * Calls the NonURLDetector UDF to remove any records where the query field is empty or is an URL. + * Calls the NonURLDetector UDF to remove records if the query field is empty or a URL. - * Calls the !ToLower UDF to lower-case the query. - * Calls the !NonPornDetector UDF to remove porno query terms (defined by a list in a text file). + * Calls the !ToLower UDF to change the query field to lowercase. + * Calls the !NonPornDetector UDF to remove records if the quer
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- || '''UDF''' || '''Description'''|| || !ExtractHour || Extracts the hour from the record.|| || N!GramGenerator || Extracts n-grams from the set of words. || - || !NonPornDetector|| Removes porn terms from the query field. || + || !NonPornDetector|| Removes the record if the query field includes porn terms. || || NonURLDetector || Removes the record if the query field is empty or a URL. || || !ScoreGenerator || Calculates a "popularity" score for the n-gram.|| - || !ToLower || Switches the query field to lowercase. || + || !ToLower || Changes the query field to lowercase. || || !TutorialUtil || Divides the query string into a set of words.|| @@ -98, +98 @@ * Calls the N!GramGenerator UDF to compose the n-grams of the query. * Calls the DISTINCT operator to get the unique n-grams for all records. * Gets the count (occurrences) of each n-gram. - * Calls the !ScoreGenerator UDF to calculate a popularity score for the n-gram. + * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the n-gram. * Removes all records with a score less than or equal to 2.0. * Sorts the remaining records by hour and score. * Saves the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean'''
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- (''page in progress ...'') - - == Pig Tutorials == The Pig tutorial shows you how to run Pig scripts in local mode or on a Hadoop cluster. * To run the scripts in local mode, no Hadoop or DFS installation is required. All files are installed and run from your local host and file system. * To run the scripts on a Hadoop cluster, you need access to a Hadoop cluster and DFS installation. - The Pig JAR file (pig.jar) and the Pig tutorial file (*.gz) include everything you need to get started. + The Pig JAR file (pig.jar) and the Pig tutorial file (*.gz) include everything you need to get started. Follow these three basic steps: + 1. Install Java (if necessary). + 1. Install Pig. + 1. Install and run the Pig scripts (in local mode or on a Hadoop cluster). + - === Java Installation === + == Java Installation == + Your run-time environment should include '''Java 1.5'''. Set the JAVA_HOME environment variable to the root of your Java installation. - Your run-time environment should include '''Java 1.5'''. - - Set the JAVA_HOME environment variable to the root of your Java installation. - === Pig Installation === + == Pig Installation == To install Pig, do the following: 1. Download the Pig JAR file (pig.jar) and move it to the appropriate directory. For example: /home/me/pig. 1. Define an environment variable with the location of the Pig JAR file. For example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh). - === Pig Scripts and Local Mode === + == Pig Scripts - Local Mode == To install and run the Pig scripts in local mode, do the following: - 1. Download and unzip the Pig tutorial file (*.gz) to your local directory (the Pig tutorial files are described below). + 1. Download and unzip the Pig tutorial file (*.gz) to your local directory (the Pig tutorial file is described below). 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) {{{ $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig @@ -41, +41 @@ }}} - === Pig Scripts and Hadoop Cluster === + == Pig Scripts - Hadoop Cluster == To install and run the Pig scripts on a Hadoop cluster, do the following: - 1. Download and unzip the Pig tutorial file (*.gz) to your local directory (the Pig tutorial files are described below). + 1. Download and unzip the Pig tutorial file (*.gz) to your local directory (the Pig tutorial file is described below). 1. Copy the exite.log file to your DFS directory. {{{ $ hadoop dfs âcopyFromLocal tutorial/excite.log . @@ -59, +59 @@ $ hadoop dfs -ls ngrams.txt }}} - === Pig Tutorial Files === + == Pig Tutorial File == - The Pig tutorial files are described here. + The contents of the Pig tutorial file (*.gz) are described here. || '''File''' || '''Description'''|| || tutorial.jar|| User-defined functions (UDFs) || || tutorial.pig || Tutorial-1 (run on Hadoop) ||
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- * To run the scripts in local mode, no Hadoop or DFS installation is required. All files are installed and run from your local host and file system. * To run the scripts on a Hadoop cluster, you need access to a Hadoop cluster and DFS installation. - The Pig JAR file (pig.jar) and the Pig tutorial file (*.tar.gz) include everything you need to get started. + The Pig JAR file (pig.jar) and the Pig tutorial file (*.gz) include everything you need to get started. === Java Installation === Your run-time environment should include '''Java 1.5'''. @@ -29, +29 @@ === Pig Scripts and Local Mode === To install and run the Pig scripts in local mode, do the following: - 1. Download and unzip the Pig tutorial file (*.tar.gz) to your local directory (the Pig tutorial files are described below). + 1. Download and unzip the Pig tutorial file (*.gz) to your local directory (the Pig tutorial files are described below). 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) {{{ $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig @@ -44, +44 @@ === Pig Scripts and Hadoop Cluster === To install and run the Pig scripts on a Hadoop cluster, do the following: - 1. Download and unzip the Pig tutorial file (*.tar.gz) to your local directory (the Pig tutorial files are described below). + 1. Download and unzip the Pig tutorial file (*.gz) to your local directory (the Pig tutorial files are described below). 1. Copy the exite.log file to your DFS directory. {{{ $ hadoop dfs âcopyFromLocal tutorial/excite.log .
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- * To run the scripts in local mode, no Hadoop or DFS installation is required. All files are installed and run from your local host and file system. * To run the scripts on a Hadoop cluster, you need access to a Hadoop cluster and DFS installation. - The Pig JAR file (pig.jar) and the Pig tutorial file include everything you need to get started. + The Pig JAR file (pig.jar) and the Pig tutorial file (*.tar.gz) include everything you need to get started. === Java Installation === - Your run-time environment should include '''Java 1.5'''. + Your run-time environment should include '''Java 1.5'''. + Set the JAVA_HOME environment variable to the root of your Java installation. === Pig Installation === To install Pig, do the following: - 1. Download the Pig JAR file ('''pig.jar''') and move it to the appropriate directory. For example, /home/me/pig. + 1. Download the Pig JAR file (pig.jar) and move it to the appropriate directory. For example: /home/me/pig. - 1. Define an environment variable with the location of the Pig JAR file. For example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh). + 1. Define an environment variable with the location of the Pig JAR file. For example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh). === Pig Scripts and Local Mode === To install and run the Pig scripts in local mode, do the following: - 1. Download and unzip the Pig tutorial file to your local directory (the Pig tutorial files are described below). + 1. Download and unzip the Pig tutorial file (*.tar.gz) to your local directory (the Pig tutorial files are described below). 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) {{{ $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig @@ -43, +44 @@ === Pig Scripts and Hadoop Cluster === To install and run the Pig scripts on a Hadoop cluster, do the following: - 1. Download and unzip the Pig tutorial file to your local directory (the Pig tutorial files are described below). + 1. Download and unzip the Pig tutorial file (*.tar.gz) to your local directory (the Pig tutorial files are described below). 1. Copy the exite.log file to your DFS directory. {{{ $ hadoop dfs âcopyFromLocal tutorial/excite.log . @@ -84, +85 @@ == Pig Tutorial-1 == - Pig Tutorial-1 (tutorial.pig or tutorial-local.pig) does the following: + The Pig Tutorial-1 script (tutorial.pig or tutorial-local.pig) does the following: * Registers the tutorial JAR file so that the included user-defined functions (UDFs) can be called in the script. * Loads the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields user, time, and query. @@ -102, +103 @@ == Pig Tutorial-2 == - Pig Tutorial-2 (tutorial-join.pig or tutorial-join-local.pig) does the following: + The Pig Tutorial-2 script (tutorial-join.pig or tutorial-join-local.pig) does the following: * Registers the tutorial JAR file so that the included user-defined functions (UDFs) can be called in the script. * Loads the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields user, time, and query.
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- (''page in progress ...'') - == Running the Pig Tutorials == + == Pig Tutorials == - You can run the Pig tutorials in local mode or on a Hadoop cluster: + The Pig tutorial shows you how to run Pig scripts in local mode or on a Hadoop cluster. + - * To run Pig in local mode, no Hadoop or DFS installation is required. All files are installed and run from your local host and file system. + * To run the scripts in local mode, no Hadoop or DFS installation is required. All files are installed and run from your local host and file system. - * To run Pig on a Hadoop cluster, you need access to a Hadoop cluster and DFS installation. + * To run the scripts on a Hadoop cluster, you need access to a Hadoop cluster and DFS installation. - The Pig JAR file and the Pig tutorial file include everything you need to run in local mode or on a Hadoop cluster. + + The Pig JAR file (pig.jar) and the Pig tutorial file include everything you need to get started. === Java Installation === - Your run-time environment should include '''Java 1.5'''. Set the JAVA_HOME environment variable to the root of your Java installation. + Your run-time environment should include '''Java 1.5'''. + Set the JAVA_HOME environment variable to the root of your Java installation. === Pig Installation === To install Pig, do the following: - 1. Download the Pig JAR file (pig.jar) and move it to the appropriate directory. For example, /home/me/pig. + 1. Download the Pig JAR file ('''pig.jar''') and move it to the appropriate directory. For example, /home/me/pig. 1. Define an environment variable with the location of the Pig JAR file. For example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh). - === Running the Pig Tutorials in Local Mode === + === Pig Scripts and Local Mode === - To run the Pig tutorial in local mode, do the following: + To install and run the Pig scripts in local mode, do the following: 1. Download and unzip the Pig tutorial file to your local directory (the Pig tutorial files are described below). 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) @@ -37, +40 @@ }}} - === Running the Pig Tutorials on a Hadoop Cluster === + === Pig Scripts and Hadoop Cluster === - To run the Pig tutorial on a Hadoop cluster, do the following: + To install and run the Pig scripts on a Hadoop cluster, do the following: 1. Download and unzip the Pig tutorial file to your local directory (the Pig tutorial files are described below). 1. Copy the exite.log file to your DFS directory. @@ -78, +81 @@ || !ToLower || Switches the query field to lowercase. || || !TutorialUtil || Divides the query string into a set of words.|| + + == Pig Tutorial-1 == + + Pig Tutorial-1 (tutorial.pig or tutorial-local.pig) does the following: + + * Registers the tutorial JAR file so that the included user-defined functions (UDFs) can be called in the script. + * Loads the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields user, time, and query. + * Calls the NonURLDetector UDF to remove any records where the query field is empty or is an URL. + * Calls the !ToLower UDF to lower-case the query. + * Calls the !NonPornDetector UDF to remove porno query terms (defined by a list in a text file). + * Calls the !ExtractHour UDF to extract the hour from the time of the record. + * Calls the N!GramGenerator UDF to compose the n-grams of the query. + * Calls the DISTINCT operator to get the unique n-grams for all records. + * Gets the count (occurrences) of each n-gram. + * Calls the !ScoreGenerator UDF to calculate a popularity score for the n-gram. + * Removes all records with a score less than or equal to 2.0. + * Sorts the remaining records by hour and score. + * Saves the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' + + == Pig Tutorial-2 == + + Pig Tutorial-2 (tutorial-join.pig or tutorial-join-local.pig) does the following: + + * Registers the tutorial JAR file so that the included user-defined functions (UDFs) can be called in the script. + * Loads the excite log file (excite.log or excite-small.log) into the ârawâ bag as an array of records with the fields user, time, and query. + * Calls the NonURLDetector UDF to remove any records where the query field is empty or is an URL. + * Calls the !ToLower UDF to lower-case the query. + * Calls the !NonPornDetector UDF to remove porno query terms (defined by a list in a text file). + * Calls the !Extrac
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- === Running the Pig Tutorials in Local Mode === To run the Pig tutorial in local mode, do the following: - 1. Download and unzip the Pig tutorial file to your local directory (all files are explained in the next section). + 1. Download and unzip the Pig tutorial file to your local directory (the Pig tutorial files are described below). 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) {{{ $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig @@ -40, +40 @@ === Running the Pig Tutorials on a Hadoop Cluster === To run the Pig tutorial on a Hadoop cluster, do the following: - 1. Download and unzip the Pig tutorial file to your local directory (all files are explained in the next section). + 1. Download and unzip the Pig tutorial file to your local directory (the Pig tutorial files are described below). 1. Copy the exite.log file to your DFS directory. {{{ $ hadoop dfs âcopyFromLocal tutorial/excite.log . @@ -56, +56 @@ }}} === Pig Tutorial Files === - The Pig tutorail files are described here. + The Pig tutorial files are described here. || '''File''' || '''Description'''|| - || tutorial.jar|| JAR file containing the UDFs || + || tutorial.jar|| User-defined functions (UDFs) || || tutorial.pig || Tutorial-1 (run on Hadoop) || - || tutorial-local.pig ||Tutorail-1 (run in local mode) || + || tutorial-local.pig ||Tutorail-1 (run in local mode) || || tutorial-join.pig || Tutorial-2 (run on Hadoop) || || tutorial-join-local.pig || Tutorial-2 (run in local mode) || - || exicte.log || Data file (for runs on Hadoop) || + || excite.log || Data file (for runs on Hadoop) || || excite-small.log || Data file (for runs in local mode) || || pornwords || Data file (porn keywords) || + The user-defined functions (UDFs) are described here. + + || '''UDF''' || '''Description'''|| + || !ExtractHour || Extracts the hour from the record.|| + || N!GramGenerator || Extracts n-grams from the set of words. || + || !NonPornDetector|| Removes porn terms from the query field. || + || NonURLDetector || Removes the record if the query field is empty or a URL. || + || !ScoreGenerator || Calculates a "popularity" score for the n-gram.|| + || !ToLower || Switches the query field to lowercase. || + || !TutorialUtil || Divides the query string into a set of words.|| +
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- $ hadoop dfs -ls ngrams.txt }}} + === Pig Tutorial Files === + The Pig tutorail files are described here. + || '''File''' || '''Description'''|| + || tutorial.jar|| JAR file containing the UDFs || + || tutorial.pig || Tutorial-1 (run on Hadoop) || + || tutorial-local.pig ||Tutorail-1 (run in local mode) || + || tutorial-join.pig || Tutorial-2 (run on Hadoop) || + || tutorial-join-local.pig || Tutorial-2 (run in local mode) || + || exicte.log || Data file (for runs on Hadoop) || + || excite-small.log || Data file (for runs in local mode) || + || pornwords || Data file (porn keywords) || +
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- === Pig Installation === To install Pig, do the following: - 1. Download the Pig JAR file (pig.jar). - 1. Move the file to the appropriate directory. For example, /home/me/pig. + 1. Download the Pig JAR file (pig.jar) and move it to the appropriate directory. For example, /home/me/pig. 1. Define an environment variable with the location of the Pig JAR file. For example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh). + + === Running the Pig Tutorials in Local Mode === + To run the Pig tutorial in local mode, do the following: + 1. Download and unzip the Pig tutorial file to your local directory (all files are explained in the next section). + 1. Execute the following command (using either tutorial-local.pig or tutorial-join-local.pig) + {{{ + $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig + }}} + + 1.#3 Review the results: + {{{ + $ ls -l /tmp/ngrams.txt + }}} +
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- === Pig Installation === To install Pig, do the following: - 1. Download the Pig JAR file (pig.jar) + 1. Download the Pig JAR file (pig.jar). + 1. Move the file to the appropriate directory. For example, /home/me/pig. 1. Define an environment variable with the location of the Pig JAR file. For example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh).
[Pig Wiki] Update of "PigTutorial" by CorinneC
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial -- The Pig JAR file and the Pig tutorial file include everything you need to run in local mode or on a Hadoop cluster. === Java Installation === + Your run-time environment should include '''Java 1.5'''. Set the JAVA_HOME environment variable to the root of your Java installation. - Your run-time environment should include the following: - * Java 1.5 - preferably from Sun. - * Set the JAVA_HOME environment variable to the root of your Java installation. === Pig Installation === - To install Pig, do the following: + To install Pig, do the following: + 1. Download the Pig JAR file (pig.jar) 1. Define an environment variable with the location of the Pig JAR file. For example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh).