[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  
   1. Download and unzip the Pig tutorial file (*.gz) to your local directory.
   1. Review the contents of the [#Pig_Tutorial_File Pig Tutorial File].
+  1. Review the [#Tutorial_Pig_Script Tutorial Pig Script] and 
the[#Tutorial_Join_Pig_Script Tutorial-Join Pig Script].
-  1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
+  1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig).
  {{{
  $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
  }}}
  
-  1.#4 Review the results:
+  1.#5 Review the results:
  {{{
  $ ls -l /tmp/ngrams.txt
  }}}
@@ -47, +48 @@

  
   1. Download and unzip the Pig tutorial file (*.gz) to your local directory. 
   1. Review the contents of the [#Pig_Tutorial_File Pig Tutorial File].
+  1. Review the [#Tutorial_Pig_Script Tutorial Pig Script] and 
the[#Tutorial_Join_Pig_Script Tutorial-Join Pig Script].
   1. Copy the exite.log file to your DFS directory.
  {{{
  $ hadoop dfs –copyFromLocal tutorial/excite.log .
  }}}
-  1.#4 Set the HADOOPSITEPATH environment variable to the location of your 
hadoop-site.xml file.
+  1.#5 Set the HADOOPSITEPATH environment variable to the location of your 
hadoop-site.xml file.
   1. Execute the following command (using either tutorial.pig or 
tutorial-join.pig):
  {{{
  $ java -cp pig.jar:$HADOOPSITEPATH org.apache.pig.Main tutorial.pig
  }}}
-  1.#6 Review the results:
+  1.#7 Review the results:
  {{{
  $ hadoop dfs -ls ngrams.txt
  }}}
@@ -86, +88 @@

  || !TutorialUtil || Divides the query string into a set of words.||
  
  
+ [[Anchor(Tutorial_Pig_Script)]]
  == Tutorial Pig Script ==
  
  The tutorial pig script (tutorial.pig or tutorial-local.pig) does the 
following:
@@ -108, +111 @@

   * Uses the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to 
store the results. The output file contains a list of n-grams with the 
following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' 
  
  
+ [[Anchor(Tutorial_Join_Pig_Script)]]
  == Tutorial-Join Pig Script ==
  
  The tutorial-join pig script (tutorial-join.pig or tutorial-join-local.pig) 
does the following:


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
- = Pig Tutorial =
- 
  (''page in progress ...'')
  
  The Pig tutorial shows you how to run Pig scripts in local mode or on a 
Hadoop cluster.


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  To install and run the Pig scripts in local mode, do the following:
  
   1. Download and unzip the Pig tutorial file (*.gz) to your local directory.
-  1. Review the contents of the Pig tutorial file.
+  1. Review the contents of the [#Pig_Tutorial_File Pig Tutorial File].
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
  {{{
  $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
@@ -48, +48 @@

  To install and run the Pig scripts on a Hadoop cluster, do the following:
  
   1. Download and unzip the Pig tutorial file (*.gz) to your local directory. 
-  1. Review the contents of the Pig tutorial file.
+  1. Review the contents of the [#Pig_Tutorial_File Pig Tutorial File].
   1. Copy the exite.log file to your DFS directory.
  {{{
  $ hadoop dfs –copyFromLocal tutorial/excite.log .
@@ -63, +63 @@

  $ hadoop dfs -ls ngrams.txt
  }}}
  
+ [[Anchor(Pig_Tutorial_File)]]
  == Pig Tutorial File ==
  The contents of the Pig tutorial file (*.gz) are described here.
  || '''File''' || '''Description'''||


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  The tutorial pig script (tutorial.pig or tutorial-local.pig) does the 
following:
  
   * Registers the tutorial JAR file so that the user-defined functions (UDFs) 
can be called in the script. 
-  * Uses the !PigStorage function to load the excite log file (excite.log or 
excite-small.log) into the “raw” bag as an array of records with the fields 
'''user''', '''time''', and '''query'''. 
+  * Uses the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to 
load the excite log file (excite.log or excite-small.log) into the “raw” 
bag as an array of records with the fields '''user''', '''time''', and 
'''query'''. 
   * Calls the NonURLDetector UDF to remove records if the query field is empty 
or a URL. 
   * Calls the !ToLower UDF to change the query field to lowercase.
   * Calls the !NonPornDetector UDF to remove records if the query field 
contains porn terms. 
@@ -100, +100 @@

   * Calls the N!GramGenerator UDF to compose the n-grams of the query. 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data 
DISTINCT] command to get the unique n-grams for all records. 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together
 GROUP] command to group records by n-gram and hour.
-  * Uses the COUNT function to get the count (occurrences) of each n-gram.
+  * Uses the [http://wiki.apache.org/pig/PigBuiltins COUNT] function to get 
the count (occurrences) of each n-gram.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together
 GROUP] command to group records by n-gram only.
   * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the 
n-gram.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data
 FOREACH-GENERATE] command to assign names to the fields.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to move all records with a score less than or equal to 2.0.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#ORDER:_Sorting_data_according_to_some_field
 ORDER] command to sort the remaining records by hour and score.
-  * Uses the !PigStorage function to store the results. The output file 
contains a list of n-grams with the following fields: '''hour''', '''ngram''', 
'''score''', '''count''', '''mean''' 
+  * Uses the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to 
store the results. The output file contains a list of n-grams with the 
following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean''' 
  
  
  == Tutorial-Join Pig Script ==
@@ -114, +114 @@

  The tutorial-join pig script (tutorial-join.pig or tutorial-join-local.pig) 
does the following:
  
   * Registers the tutorial JAR file so that the user-defined functions (UDFs) 
can be called in the script. 
-  * Uses the !PigStorage function to load the excite log file (excite.log or 
excite-small.log) into the “raw” bag as an array of records with the fields 
'''user''', '''time''', and '''query'''. 
+  * Uses the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to 
load the excite log file (excite.log or excite-small.log) into the “raw” 
bag as an array of records with the fields '''user''', '''time''', and 
'''query'''. 
   * Calls the NonURLDetector UDF to remove records if the query field is empty 
or a URL. 
   * Calls the !ToLower UDF to change the query field to lowercase.
   * Calls the Non!PornDetector UDF to remove records if the query field 
contains porn terms. 
@@ -122, +122 @@

   * Calls the N!GramGenerator UDF to compose the n-grams of the query. 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data 
DISTINCT] command to get the unique n-grams for all records. 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together
 GROUP] command to group the records by n-gram and hour. 
-  * Uses the COUNT function to get the count (occurrences) of each n-gram. 
+  * Uses the [http://wiki.apache.org/pig/PigBuiltins COUNT] function to get 
the count (occurrences) of each n-gram. 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data
 FOREACH-GENERATE] command to assign names to the fields.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘00’ 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘12’ 
   *

[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
   * Calls the !NonPornDetector UDF to remove records if the query field 
contains porn terms. 
   * Calls the !ExtractHour UDF to extract the hour from the time field.
   * Calls the N!GramGenerator UDF to compose the n-grams of the query. 
-  * Uses the DISTINCT command to get the unique n-grams for all records. 
-  * Uses the GROUP command to group records by n-gram and hour.
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data 
DISTINCT] command to get the unique n-grams for all records. 
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together
 GROUP] command to group records by n-gram and hour.
   * Uses the COUNT function to get the count (occurrences) of each n-gram.
-  * Uses the GROUP command to group records by n-gram only.
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together
 GROUP] command to group records by n-gram only.
   * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the 
n-gram.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data
 FOREACH-GENERATE] command to assign names to the fields.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to move all records with a score less than or equal to 2.0.
-  * Uses the ORDER command to sort the remaining records by hour and score.
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#ORDER:_Sorting_data_according_to_some_field
 ORDER] command to sort the remaining records by hour and score.
   * Uses the !PigStorage function to store the results. The output file 
contains a list of n-grams with the following fields: '''hour''', '''ngram''', 
'''score''', '''count''', '''mean''' 
  
  
@@ -120, +120 @@

   * Calls the Non!PornDetector UDF to remove records if the query field 
contains porn terms. 
   * Calls the !ExtractHour UDF to extract the hour from the time field.
   * Calls the N!GramGenerator UDF to compose the n-grams of the query. 
-  * Uses the DISTINCT command to get the unique n-grams for all records. 
-  * Uses the GROUP command to group the records by n-gram and hour. 
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data 
DISTINCT] command to get the unique n-grams for all records. 
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together
 GROUP] command to group the records by n-gram and hour. 
   * Uses the COUNT function to get the count (occurrences) of each n-gram. 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data
 FOREACH-GENERATE] command to assign names to the fields.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘00’ 


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
   * Uses the COUNT function to get the count (occurrences) of each n-gram.
   * Uses the GROUP command to group records by n-gram only.
   * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the 
n-gram.
-  * Uses the GENERATE command to assign names to the fields.
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data
 FOREACH-GENERATE] command to assign names to the fields.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to move all records with a score less than or equal to 2.0.
   * Uses the ORDER command to sort the remaining records by hour and score.
   * Uses the !PigStorage function to store the results. The output file 
contains a list of n-grams with the following fields: '''hour''', '''ngram''', 
'''score''', '''count''', '''mean''' 
@@ -123, +123 @@

   * Uses the DISTINCT command to get the unique n-grams for all records. 
   * Uses the GROUP command to group the records by n-gram and hour. 
   * Uses the COUNT function to get the count (occurrences) of each n-gram. 
-  * Uses the GENERATE command to assign names to the fields.
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data
 FOREACH-GENERATE] command to assign names to the fields.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘00’ 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘12’ 
   * Uses the [http://wiki.apache.org/pig/PigLatin#Joining JOIN] command to 
join the n-grams in hour “00” and  hour “12” by field $0


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
   * Uses the GENERATE command to assign names to the fields.
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘00’ 
   * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘12’ 
-  * Uses the JOIN command to join the n-grams in hour “00” and  hour 
“12” by field $0
+  * Uses the [http://wiki.apache.org/pig/PigLatin#Joining JOIN] command to 
join the n-grams in hour “00” and  hour “12” by field $0
   * Uses the COUNT function to get the count (occurrences) of the n-grams in 
both “00” and “12” 
   * Uses the !PigStorage function to store the results. The output file 
contains a list of n-grams with the following fields: '''hour''', 
'''count00''', '''count12'''
  


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
   * Uses the GROUP command to group records by n-gram only.
   * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the 
n-gram.
   * Uses the GENERATE command to assign names to the fields.
-  * Uses the FILTER command to move all records with a score less than or 
equal to 2.0.
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to move all records with a score less than or equal to 2.0.
   * Uses the ORDER command to sort the remaining records by hour and score.
   * Uses the !PigStorage function to store the results. The output file 
contains a list of n-grams with the following fields: '''hour''', '''ngram''', 
'''score''', '''count''', '''mean''' 
  
@@ -124, +124 @@

   * Uses the GROUP command to group the records by n-gram and hour. 
   * Uses the COUNT function to get the count (occurrences) of each n-gram. 
   * Uses the GENERATE command to assign names to the fields.
-  * Uses the FILTER command to get the n-grams for hour ‘00’ 
-  * Uses the FILTER command to get the n-grams for hour ‘12’ 
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘00’ 
+  * Uses the 
[http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
 FILTER] command to get the n-grams for hour ‘12’ 
   * Uses the JOIN command to join the n-grams in hour “00” and  hour 
“12” by field $0
   * Uses the COUNT function to get the count (occurrences) of the n-grams in 
both “00” and “12” 
   * Uses the !PigStorage function to store the results. The output file 
contains a list of n-grams with the following fields: '''hour''', 
'''count00''', '''count12'''


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
   1. Define an environment variable with the location of the Pig JAR file. For 
example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
  
  
- == Pig Scripts - Local Mode ==
+ == Pig Script Installation and Run - Local Mode ==
  To install and run the Pig scripts in local mode, do the following:
  
-  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial file is described below).
+  1. Download and unzip the Pig tutorial file (*.gz) to your local directory.
+  1. Review the contents of the Pig tutorial file.
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
  {{{
  $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
  }}}
  
-  1.#3 Review the results:
+  1.#4 Review the results:
  {{{
  $ ls -l /tmp/ngrams.txt
  }}}
  
  
- == Pig Scripts - Hadoop Cluster ==
+ == Pig Script Installation and Run - Hadoop Cluster ==
  To install and run the Pig scripts on a Hadoop cluster, do the following:
  
-  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial file is described below).
+  1. Download and unzip the Pig tutorial file (*.gz) to your local directory. 
+  1. Review the contents of the Pig tutorial file.
   1. Copy the exite.log file to your DFS directory.
  {{{
  $ hadoop dfs –copyFromLocal tutorial/excite.log .
  }}}
-  1.#3 Set the HADOOPSITEPATH environment variable to the location of your 
hadoop-site.xml file.
+  1.#4 Set the HADOOPSITEPATH environment variable to the location of your 
hadoop-site.xml file.
   1. Execute the following command (using either tutorial.pig or 
tutorial-join.pig):
  {{{
  $ java -cp pig.jar:$HADOOPSITEPATH org.apache.pig.Main tutorial.pig
  }}}
-  1.#5 Review the results:
+  1.#6 Review the results:
  {{{
  $ hadoop dfs -ls ngrams.txt
  }}}
@@ -65, +67 @@

  The contents of the Pig tutorial file (*.gz) are described here.
  || '''File''' || '''Description'''||
  || tutorial.jar|| User-defined functions (UDFs) ||
- || tutorial.pig || Tutorial-1 (run on Hadoop) ||
+ || tutorial.pig || Tutorial pig script (Hadoop) ||
- || tutorial-local.pig ||Tutorail-1 (run in local mode) ||
+ || tutorial-local.pig ||Tutorail pig script(local mode) ||
- || tutorial-join.pig || Tutorial-2 (run on Hadoop) ||
+ || tutorial-join.pig || Tutorial-join pig script(Hadoop) ||
- || tutorial-join-local.pig || Tutorial-2 (run in local mode) ||
+ || tutorial-join-local.pig || Tutorial-join pig script(local mode) ||
- || excite.log || Data file (for runs on Hadoop) ||
+ || excite.log || Data file (Hadoop) ||
- || excite-small.log || Data file (for runs in local mode) ||
+ || excite-small.log || Data file (local mode) ||
  || pornwords || Data file (porn keywords) ||
  
  The user-defined functions (UDFs) are described here.
  
  || '''UDF''' || '''Description'''||
  || !ExtractHour || Extracts the hour from the record.||
- || N!GramGenerator || Extracts n-grams from the set of words. ||
+ || N!GramGenerator || Composes n-grams from the set of words. ||
  || !NonPornDetector|| Removes the record if the query field includes porn 
terms. ||
  || NonURLDetector || Removes the record if the query field is empty or a URL. 
||
  || !ScoreGenerator || Calculates a "popularity" score for the n-gram.||
@@ -85, +87 @@

  || !TutorialUtil || Divides the query string into a set of words.||
  
  
- == Pig Tutorial-1 ==
+ == Tutorial Pig Script ==
  
- The Pig Tutorial-1 script (tutorial.pig or tutorial-local.pig) does the 
following:
+ The tutorial pig script (tutorial.pig or tutorial-local.pig) does the 
following:
  
-  * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
+  * Registers the tutorial JAR file so that the user-defined functions (UDFs) 
can be called in the script. 
-  * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.
+  * Uses the !PigStorage function to load the excite log file (excite.log or 
excite-small.log) into the “raw” bag as an array of records with the fields 
'''user''', '''time''', and '''query'''. 
-  * Calls the NonURLDetector UDF to remove any records where the query field 
is empty or is an URL.
+  * Calls the NonURLDetector UDF to remove records if the query field is empty 
or a URL. 
-  * Calls the !ToLower UDF to lower-case the query.
-  * Calls the !NonPornDetector UDF to remove porno query terms (defined by a 
list in a text file).
+  * Calls the !ToLower UDF to change the query field to lowercase.
+  * Calls the !NonPornDetector UDF to remove records if the quer

[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  || '''UDF''' || '''Description'''||
  || !ExtractHour || Extracts the hour from the record.||
  || N!GramGenerator || Extracts n-grams from the set of words. ||
- || !NonPornDetector|| Removes porn terms from the query field. ||
+ || !NonPornDetector|| Removes the record if the query field includes porn 
terms. ||
  || NonURLDetector || Removes the record if the query field is empty or a URL. 
||
  || !ScoreGenerator || Calculates a "popularity" score for the n-gram.||
- || !ToLower || Switches the query field to lowercase. ||
+ || !ToLower || Changes the query field to lowercase. ||
  || !TutorialUtil || Divides the query string into a set of words.||
  
  
@@ -98, +98 @@

   * Calls the N!GramGenerator UDF to compose the n-grams of the query.
   * Calls the DISTINCT operator to get the unique n-grams for all records.
   * Gets the count (occurrences) of each n-gram.
-  * Calls the !ScoreGenerator UDF to calculate a popularity score for the 
n-gram.
+  * Calls the !ScoreGenerator UDF to calculate a "popularity" score for the 
n-gram.
   * Removes all records with a score less than or equal to 2.0.
   * Sorts the remaining records by hour and score.
   * Saves the results. The output file contains a list of n-grams with the 
following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean'''


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  
  (''page in progress ...'')
  
- 
- == Pig Tutorials ==
  The Pig tutorial shows you how to run Pig scripts in local mode or on a 
Hadoop cluster.
  
   * To run the scripts in local mode, no Hadoop or DFS installation is 
required. All files are installed and run from your local host and file system.
   * To run the scripts on a Hadoop cluster, you need access to a Hadoop 
cluster and DFS installation.
  
- The Pig JAR file (pig.jar) and the Pig tutorial file (*.gz) include 
everything you need to get started.
+ The Pig JAR file (pig.jar) and the Pig tutorial file (*.gz) include 
everything you need to get started. Follow these three basic steps:
  
+  1. Install Java (if necessary).
+  1. Install Pig.
+  1. Install and run the Pig scripts (in local mode or on a Hadoop cluster).
+ 
- === Java Installation ===
+ == Java Installation ==
+ Your run-time environment should include '''Java 1.5'''. Set the JAVA_HOME 
environment variable to the root of your Java installation. 
- Your run-time environment should include '''Java 1.5'''.
-  
- Set the JAVA_HOME environment variable to the root of your Java installation. 
  
  
- === Pig Installation ===
+ == Pig Installation ==
  To install Pig, do the following:
  
   1. Download the Pig JAR file (pig.jar) and move it to the appropriate 
directory. For example:  /home/me/pig. 
   1. Define an environment variable with the location of the Pig JAR file. For 
example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
  
  
- === Pig Scripts and Local Mode ===
+ == Pig Scripts - Local Mode ==
  To install and run the Pig scripts in local mode, do the following:
  
-  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial files are described below).
+  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial file is described below).
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
  {{{
  $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
@@ -41, +41 @@

  }}}
  
  
- === Pig Scripts and Hadoop Cluster ===
+ == Pig Scripts - Hadoop Cluster ==
  To install and run the Pig scripts on a Hadoop cluster, do the following:
  
-  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial files are described below).
+  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial file is described below).
   1. Copy the exite.log file to your DFS directory.
  {{{
  $ hadoop dfs –copyFromLocal tutorial/excite.log .
@@ -59, +59 @@

  $ hadoop dfs -ls ngrams.txt
  }}}
  
- === Pig Tutorial Files ===
+ == Pig Tutorial File ==
- The Pig tutorial files are described here.
+ The contents of the Pig tutorial file (*.gz) are described here.
  || '''File''' || '''Description'''||
  || tutorial.jar|| User-defined functions (UDFs) ||
  || tutorial.pig || Tutorial-1 (run on Hadoop) ||


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
   * To run the scripts in local mode, no Hadoop or DFS installation is 
required. All files are installed and run from your local host and file system.
   * To run the scripts on a Hadoop cluster, you need access to a Hadoop 
cluster and DFS installation.
  
- The Pig JAR file (pig.jar) and the Pig tutorial file (*.tar.gz) include 
everything you need to get started.
+ The Pig JAR file (pig.jar) and the Pig tutorial file (*.gz) include 
everything you need to get started.
  
  === Java Installation ===
  Your run-time environment should include '''Java 1.5'''.
@@ -29, +29 @@

  === Pig Scripts and Local Mode ===
  To install and run the Pig scripts in local mode, do the following:
  
-  1. Download and unzip the Pig tutorial file (*.tar.gz) to your local 
directory (the Pig tutorial files are described below).
+  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial files are described below).
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
  {{{
  $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
@@ -44, +44 @@

  === Pig Scripts and Hadoop Cluster ===
  To install and run the Pig scripts on a Hadoop cluster, do the following:
  
-  1. Download and unzip the Pig tutorial file (*.tar.gz) to your local 
directory (the Pig tutorial files are described below).
+  1. Download and unzip the Pig tutorial file (*.gz) to your local directory 
(the Pig tutorial files are described below).
   1. Copy the exite.log file to your DFS directory.
  {{{
  $ hadoop dfs –copyFromLocal tutorial/excite.log .


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
   * To run the scripts in local mode, no Hadoop or DFS installation is 
required. All files are installed and run from your local host and file system.
   * To run the scripts on a Hadoop cluster, you need access to a Hadoop 
cluster and DFS installation.
  
- The Pig JAR file (pig.jar) and the Pig tutorial file include everything you 
need to get started.
+ The Pig JAR file (pig.jar) and the Pig tutorial file (*.tar.gz) include 
everything you need to get started.
  
  === Java Installation ===
- Your run-time environment should include '''Java 1.5'''. 
+ Your run-time environment should include '''Java 1.5'''.
+  
  Set the JAVA_HOME environment variable to the root of your Java installation. 
  
  
  === Pig Installation ===
  To install Pig, do the following:
  
-  1. Download the Pig JAR file ('''pig.jar''') and move it to the appropriate 
directory. For example, /home/me/pig. 
+  1. Download the Pig JAR file (pig.jar) and move it to the appropriate 
directory. For example:  /home/me/pig. 
-  1. Define an environment variable with the location of the Pig JAR file. For 
example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
+  1. Define an environment variable with the location of the Pig JAR file. For 
example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
  
  
  === Pig Scripts and Local Mode ===
  To install and run the Pig scripts in local mode, do the following:
  
-  1. Download and unzip the Pig tutorial file to your local directory (the Pig 
tutorial files are described below).
+  1. Download and unzip the Pig tutorial file (*.tar.gz) to your local 
directory (the Pig tutorial files are described below).
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
  {{{
  $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
@@ -43, +44 @@

  === Pig Scripts and Hadoop Cluster ===
  To install and run the Pig scripts on a Hadoop cluster, do the following:
  
-  1. Download and unzip the Pig tutorial file to your local directory (the Pig 
tutorial files are described below).
+  1. Download and unzip the Pig tutorial file (*.tar.gz) to your local 
directory (the Pig tutorial files are described below).
   1. Copy the exite.log file to your DFS directory.
  {{{
  $ hadoop dfs –copyFromLocal tutorial/excite.log .
@@ -84, +85 @@

  
  == Pig Tutorial-1 ==
  
- Pig Tutorial-1 (tutorial.pig or tutorial-local.pig) does the following:
+ The Pig Tutorial-1 script (tutorial.pig or tutorial-local.pig) does the 
following:
  
   * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
   * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.
@@ -102, +103 @@

  
  == Pig Tutorial-2 ==
  
- Pig Tutorial-2 (tutorial-join.pig or tutorial-join-local.pig) does the 
following:
+ The Pig Tutorial-2 script (tutorial-join.pig or tutorial-join-local.pig) does 
the following:
  
   * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
   * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  (''page in progress ...'')
  
  
- == Running the Pig Tutorials ==
+ == Pig Tutorials ==
- You can run the Pig tutorials in local mode or on a Hadoop cluster:
+ The Pig tutorial shows you how to run Pig scripts in local mode or on a 
Hadoop cluster.
+ 
-  * To run Pig in local mode, no Hadoop or DFS installation is required. All 
files are installed and run from your local host and file system.
+  * To run the scripts in local mode, no Hadoop or DFS installation is 
required. All files are installed and run from your local host and file system.
-  * To run Pig on a Hadoop cluster, you need access to a Hadoop cluster and 
DFS installation.
+  * To run the scripts on a Hadoop cluster, you need access to a Hadoop 
cluster and DFS installation.
- The Pig JAR file and the Pig tutorial file include everything you need to run 
in local mode or on a Hadoop cluster.
+ 
+ The Pig JAR file (pig.jar) and the Pig tutorial file include everything you 
need to get started.
  
  === Java Installation ===
- Your run-time environment should include '''Java 1.5'''. Set the JAVA_HOME 
environment variable to the root of your Java installation. 
+ Your run-time environment should include '''Java 1.5'''. 
+ Set the JAVA_HOME environment variable to the root of your Java installation. 
  
  
  === Pig Installation ===
  To install Pig, do the following:
  
-  1. Download the Pig JAR file (pig.jar) and move it to the appropriate 
directory. For example, /home/me/pig. 
+  1. Download the Pig JAR file ('''pig.jar''') and move it to the appropriate 
directory. For example, /home/me/pig. 
   1. Define an environment variable with the location of the Pig JAR file. For 
example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
  
  
- === Running the Pig Tutorials in Local Mode ===
+ === Pig Scripts and Local Mode ===
- To run the Pig tutorial in local mode, do the following:
+ To install and run the Pig scripts in local mode, do the following:
  
   1. Download and unzip the Pig tutorial file to your local directory (the Pig 
tutorial files are described below).
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
@@ -37, +40 @@

  }}}
  
  
- === Running the Pig Tutorials on a Hadoop Cluster ===
+ === Pig Scripts and Hadoop Cluster ===
- To run the Pig tutorial on a Hadoop cluster, do the following:
+ To install and run the Pig scripts on a Hadoop cluster, do the following:
  
   1. Download and unzip the Pig tutorial file to your local directory (the Pig 
tutorial files are described below).
   1. Copy the exite.log file to your DFS directory.
@@ -78, +81 @@

  || !ToLower || Switches the query field to lowercase. ||
  || !TutorialUtil || Divides the query string into a set of words.||
  
+ 
+ == Pig Tutorial-1 ==
+ 
+ Pig Tutorial-1 (tutorial.pig or tutorial-local.pig) does the following:
+ 
+  * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
+  * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.
+  * Calls the NonURLDetector UDF to remove any records where the query field 
is empty or is an URL.
+  * Calls the !ToLower UDF to lower-case the query.
+  * Calls the !NonPornDetector UDF to remove porno query terms (defined by a 
list in a text file).
+  * Calls the !ExtractHour UDF to extract the hour from the time of the record.
+  * Calls the N!GramGenerator UDF to compose the n-grams of the query.
+  * Calls the DISTINCT operator to get the unique n-grams for all records.
+  * Gets the count (occurrences) of each n-gram.
+  * Calls the !ScoreGenerator UDF to calculate a popularity score for the 
n-gram.
+  * Removes all records with a score less than or equal to 2.0.
+  * Sorts the remaining records by hour and score.
+  * Saves the results. The output file contains a list of n-grams with the 
following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean'''
+ 
+ == Pig Tutorial-2 ==
+ 
+ Pig Tutorial-2 (tutorial-join.pig or tutorial-join-local.pig) does the 
following:
+ 
+  * Registers the tutorial JAR file so that the included user-defined 
functions (UDFs) can be called in the script.
+  * Loads the excite log file (excite.log or excite-small.log) into the 
“raw” bag as an array of records with the fields user, time, and query.
+  * Calls the NonURLDetector UDF to remove any records where the query field 
is empty or is an URL.
+  * Calls the !ToLower UDF to lower-case the query.
+  * Calls the !NonPornDetector UDF to remove porno query terms (defined by a 
list in a text file).
+  * Calls the !Extrac

[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  === Running the Pig Tutorials in Local Mode ===
  To run the Pig tutorial in local mode, do the following:
  
-  1. Download and unzip the Pig tutorial file to your local directory (all 
files are explained in the next section).
+  1. Download and unzip the Pig tutorial file to your local directory (the Pig 
tutorial files are described below).
   1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
  {{{
  $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
@@ -40, +40 @@

  === Running the Pig Tutorials on a Hadoop Cluster ===
  To run the Pig tutorial on a Hadoop cluster, do the following:
  
-  1. Download and unzip the Pig tutorial file to your local directory (all 
files are explained in the next section).
+  1. Download and unzip the Pig tutorial file to your local directory (the Pig 
tutorial files are described below).
   1. Copy the exite.log file to your DFS directory.
  {{{
  $ hadoop dfs –copyFromLocal tutorial/excite.log .
@@ -56, +56 @@

  }}}
  
  === Pig Tutorial Files ===
- The Pig tutorail files are described here.
+ The Pig tutorial files are described here.
  || '''File''' || '''Description'''||
- || tutorial.jar|| JAR file containing the UDFs  ||
+ || tutorial.jar|| User-defined functions (UDFs) ||
  || tutorial.pig || Tutorial-1 (run on Hadoop) ||
- || tutorial-local.pig ||Tutorail-1 (run in local mode)  ||
+ || tutorial-local.pig ||Tutorail-1 (run in local mode) ||
  || tutorial-join.pig || Tutorial-2 (run on Hadoop) ||
  || tutorial-join-local.pig || Tutorial-2 (run in local mode) ||
- || exicte.log || Data file (for runs on Hadoop) ||
+ || excite.log || Data file (for runs on Hadoop) ||
  || excite-small.log || Data file (for runs in local mode) ||
  || pornwords || Data file (porn keywords) ||
  
+ The user-defined functions (UDFs) are described here.
+ 
+ || '''UDF''' || '''Description'''||
+ || !ExtractHour || Extracts the hour from the record.||
+ || N!GramGenerator || Extracts n-grams from the set of words. ||
+ || !NonPornDetector|| Removes porn terms from the query field. ||
+ || NonURLDetector || Removes the record if the query field is empty or a URL. 
||
+ || !ScoreGenerator || Calculates a "popularity" score for the n-gram.||
+ || !ToLower || Switches the query field to lowercase. ||
+ || !TutorialUtil || Divides the query string into a set of words.||
+ 


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  $ hadoop dfs -ls ngrams.txt
  }}}
  
+ === Pig Tutorial Files ===
+ The Pig tutorail files are described here.
+ || '''File''' || '''Description'''||
+ || tutorial.jar|| JAR file containing the UDFs  ||
+ || tutorial.pig || Tutorial-1 (run on Hadoop) ||
+ || tutorial-local.pig ||Tutorail-1 (run in local mode)  ||
+ || tutorial-join.pig || Tutorial-2 (run on Hadoop) ||
+ || tutorial-join-local.pig || Tutorial-2 (run in local mode) ||
+ || exicte.log || Data file (for runs on Hadoop) ||
+ || excite-small.log || Data file (for runs in local mode) ||
+ || pornwords || Data file (porn keywords) ||
+ 


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  === Pig Installation ===
  To install Pig, do the following:
  
-  1. Download the Pig JAR file (pig.jar).
-  1. Move the file to the appropriate directory. For example, /home/me/pig. 
+  1. Download the Pig JAR file (pig.jar) and move it to the appropriate 
directory. For example, /home/me/pig. 
   1. Define an environment variable with the location of the Pig JAR file. For 
example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
  
+ 
+ === Running the Pig Tutorials in Local Mode ===
+ To run the Pig tutorial in local mode, do the following:
+  1. Download and unzip the Pig tutorial file to your local directory (all 
files are explained in the next section).
+  1. Execute the following command (using either tutorial-local.pig or 
tutorial-join-local.pig)
+ {{{
+ $ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local tutorial-local.pig
+ }}}
+ 
+  1.#3 Review the results:
+ {{{
+ $ ls -l /tmp/ngrams.txt
+ }}}
+ 


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  === Pig Installation ===
  To install Pig, do the following:
  
-  1. Download the Pig JAR file (pig.jar)
+  1. Download the Pig JAR file (pig.jar).
+  1. Move the file to the appropriate directory. For example, /home/me/pig. 
   1. Define an environment variable with the location of the Pig JAR file. For 
example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).
  


[Pig Wiki] Update of "PigTutorial" by CorinneC

2008-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

--
  The Pig JAR file and the Pig tutorial file include everything you need to run 
in local mode or on a Hadoop cluster.
  
  === Java Installation ===
+ Your run-time environment should include '''Java 1.5'''. Set the JAVA_HOME 
environment variable to the root of your Java installation. 
- Your run-time environment should include the following:
-  * Java 1.5 - preferably from Sun. 
-  * Set the JAVA_HOME environment variable to the root of your Java 
installation. 
  
  
  === Pig Installation ===
- To install Pig, do the following: 
+ To install Pig, do the following:
+ 
   1. Download the Pig JAR file (pig.jar)
   1. Define an environment variable with the location of the Pig JAR file. For 
example, export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig 
(tcsh, csh).