[Pig Wiki] Update of EmbeddedPig by OlgaN

2007-11-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/EmbeddedPig

New page:
== Embedding Pig In Java Programs == 

Sometimes you want more control than Pig scripts can give you. If so, you can 
embed Pig Latin in Java (just like SQL can be embedded in programs using JDBC). 

The following steps need to be carried out:

 * Make sure `pig.jar` is on your classpath.
 * Create an instance of `PigServer`
 * Issue commands through that PigServer by calling 
`PigServer.registerQuery()`.  
 * To retrieve results, either call `PigServer.openIterator()` or 
`PigServer.store()`.
 * If you have user defined functions, register them by calling 
`PigServer.registerJar()`.

=== Example ===

Lets assume that I need to count the number of occurrences of each word in a 
document. Lets also assume that you have EvalFunction `Tokenize` that parses a 
line of text and returns all the words for that line. The function is located 
in `/mylocation/tokenize.jar`.

PigLatin script for this computation will look as follows:

{{{
register /mylocation/tokenize.jar
A = load 'mytext' using TextLoader();
B = foreach A generate flatten(tokenize($0));
C = group B by $1;
D = foreach C generate flatten(group), COUNT(B.$0);
store D into 'myoutput';
}}}

The same can be accomplished with the following Java program

{{{

import java.io.IOException;
import org.apache.pig.PigServer;

public class WordCount {
   public static void main(String[] args) {
  
  PigServer pigServer = new PigServer();

  try {
 pigServer.registerJar(/mylocation/tokenize.jar);
 runMyQuery(pigServer, myinput.txt;
} catch (IOException e) {
 e.printStackTrace();
  }
   }
   
   public static void runMyQuery(PigServer pigServer, String inputFile) throws 
IOException {
   pigServer.registerQuery(A = load ' + inputFile + ' using 
TextLoader(););
   pigServer.registerQuery(B = foreach A generate flatten(tokenize($0)););
   pigServer.registerQuery(C = group B by $1;);
   pigServer.registerQuery(D = foreach C generate flatten(group), 
COUNT(B.$0););
  
   pigServer.store(D, myoutput);
   }
}

}}}

Notes:

 * The jar which contains your functions must be registered.
 * The four calls to `pigServer.registerQuery()` simply cause the query to be 
parsed and enquired. The query is not actually executed until 
`pigServer.store()` is called.
 * The input data referred to on the load statement, must be on DFS in the 
specified location.
 * The final result is placed into `myoutput` file in the your current working 
directory on DFS. (By default this is your home directory on DFS.)

To run your program, you need to first compile it by using the following 
command:

{{{
javac -cp pathpig.jar WordCount.java
}}}

If the compilation is successful, you can then run your program:

{{{
java -cp pathpig.jar WordCount
}}}


[Pig Wiki] Update of EmbeddedPig by OlgaN

2007-11-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/EmbeddedPig

--
  The following steps need to be carried out:
  
   * Make sure `pig.jar` is on your classpath.
-  * Create an instance of `PigServer`
+  * Create an instance of `PigServer`. See 
[http://incubator.apache.org/pig/javadoc/doc/ Javadoc] for more details.
   * Issue commands through that PigServer by calling 
`PigServer.registerQuery()`.  
   * To retrieve results, either call `PigServer.openIterator()` or 
`PigServer.store()`.
   * If you have user defined functions, register them by calling 
`PigServer.registerJar()`.