[Pig Wiki] Trivial Update of "WriteFunctions" by CorinneC

2008-10-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/WriteFunctions

--
  [[Anchor(Writing_your_own_Pig_functions)]]
- == Writing your own user defined Pig functions ==
+ == Pig User Defined Functions ==
  
- Pig has a number of built-in functions for loading, filtering, aggregating, 
etc. (A complete list is available at PigBuiltins.) However, if you want to do 
something specialized you may need to write your own user defined function 
(UDF). This page will walk you through how to do this.
+ Pig has a number of built-in functions for loading, filtering, aggregating 
data (for a complete list, see PigBuiltins.) However, if you want to do 
something specialized, you may need to write your own Pig user defined function 
(UDF). This page walks you through the process.
  
  [[Anchor(Types_of_functions)]]
  === Types of functions ===
+ 
+ '''Eval Function'''
+ 
- The most important type and commonly used type of functions are EvalFunction. 
Eval functions consume a tuple, do some computation, and produce some data 
+ The most important and commonly used type of functions are EvalFunction. Eval 
functions consume a tuple, do some computation, and produce some data. 
  
  Eval functions are very flexible, e.g. they can mimic "map" and "reduce" 
style functions:
* ''"Map" behavior:'' The output type of an Eval Function is one of: a 
single value, a tuple, or a bag of tuples (a Map/Reduce "map" function produces 
a bag of tuples).
* ''"Reduce" behavior:'' Recall that in the Pig data model, a tuple may 
contain fields of type ''bag''. Hence an Eval Function may perform aggregation 
or "reducing" by iterating over a bag of tuples nested within the input tuple. 
This is how the built-in aggregation function SUM(...) works, for example.   
-
- The other types of functions are:
-* '''Load Function:''' controls reading of tuples from files
-* '''Store Function:''' controls storing of tuples to files
+ 
+ '''Load Function'''
+  
+ Controls reading of tuples from files.
+ 
+ '''Store Function'''
+  
+ Controls storing of tuples to files.
  
  [[Anchor(Example)]]
   Example 


[Pig Wiki] Trivial Update of "WriteFunctions" by CorinneC

2008-09-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/WriteFunctions

--
  [[Anchor(Writing_your_own_Pig_functions)]]
  == Writing your own user defined Pig functions ==
  
- Pig has a number of built-in functions for loading, filtering, aggregating, 
etc. (A complete list is available at PigBuiltins.) However, if you want to do 
something specialized you may need to write your own function. This page will 
walk you through how to do this.
+ Pig has a number of built-in functions for loading, filtering, aggregating, 
etc. (A complete list is available at PigBuiltins.) However, if you want to do 
something specialized you may need to write your own user defined function 
(UDF). This page will walk you through how to do this.
  
  [[Anchor(Types_of_functions)]]
  === Types of functions ===


[Pig Wiki] Trivial Update of "WriteFunctions" by CorinneC

2008-09-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/WriteFunctions

--
  [[Anchor(Writing_your_own_Pig_functions)]]
- == Writing your own Pig functions ==
+ == Writing your own user defined Pig functions ==
  
  Pig has a number of built-in functions for loading, filtering, aggregating, 
etc. (A complete list is available at PigBuiltins.) However, if you want to do 
something specialized you may need to write your own function. This page will 
walk you through how to do this.
  


[Pig Wiki] Trivial Update of "WriteFunctions" by CorinneC

2008-07-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/WriteFunctions

New page:
[[Anchor(Writing_your_own_Pig_functions)]]
== Writing your own Pig functions ==

Pig has a number of built-in functions for loading, filtering, aggregating, 
etc. (A complete list is available at PigBuiltins.) However, if you want to do 
something specialized you may need to write your own function. This page will 
walk you through how to do this.

[[Anchor(Types_of_functions)]]
=== Types of functions ===
The most important type and commonly used type of functions are EvalFunction. 
Eval functions consume a tuple, do some computation, and produce some data 

Eval functions are very flexible, e.g. they can mimic "map" and "reduce" style 
functions:
  * ''"Map" behavior:'' The output type of an Eval Function is one of: a 
single value, a tuple, or a bag of tuples (a Map/Reduce "map" function produces 
a bag of tuples).
  * ''"Reduce" behavior:'' Recall that in the Pig data model, a tuple may 
contain fields of type ''bag''. Hence an Eval Function may perform aggregation 
or "reducing" by iterating over a bag of tuples nested within the input tuple. 
This is how the built-in aggregation function SUM(...) works, for example.   
   
The other types of functions are:
   * '''Load Function:''' controls reading of tuples from files
   * '''Store Function:''' controls storing of tuples to files

[[Anchor(Example)]]
 Example 

The following example uses each of the types of functions. It computes the set 
of unique IP addresses associated with "good" products drawn from a list of 
products found on the web.

{{{
register myFunctions.jar
products = LOAD '/productlist.txt' USING MyListStorage() AS (name, price, 
description, url);
goodProducts = FILTER products BY (price <= '19.99');
hostnames = FOREACH goodProducts GENERATE MyHostExtractor(url) AS hostname;
uniqueIPs = FOREACH (GROUP hostnames BY MyIPLookup(hostname)) GENERATE group AS 
ipAddress;
STORE uniqueIPs INTO '/iplist.txt' USING MyListStorage();
}}}

In the above example, !MyListStorage() serves as a load function as well as a 
store function; !MyHostExtractor() and MyIPLookup() are eval functions. 
`myFunctions.jar` is a jar file that contains the classes for the user-defined 
functions.


[[Anchor(How_to_write_functions)]]
=== How to write functions ===

Ready to write your own handy-dandy pig function? Before you start, you will 
need to know about the APIs for interacting with the data types (atom, tuple, 
bag). Click here: PigDataTypeApis.

Click below to learn how to build your own:
   * EvalFunction
   * [http://wiki.apache.org/pig/StorageFunction Load/Store Function] (These 
are the most difficult to write, and usually, the inbuilt ones should be enough)

[[Anchor(Ok,_I_have_written_my_function,_how_to_use_it?)]]
=== Ok, I have written my function, how to use it? ===

You can use your functions following the steps below:

   * Put all the compiled files used by your function together into a jar file
   * Tell Pig about that jar by the `register ` command before using 
the function. To register a UDF jar, you can either specify a full path to the 
jar file (`register /home/myjars/udfs.jar`) or you can place the jar file in 
your classpath and pig will find it there (`register udfs.jar`). (If you are 
using PigLatin in embedded mode, call `PigServer.registerJar()`).
   * Then use your function, as you would use a builtin! Its that simple.

Example:

The following example describes how to use your Eval function. Follow the same 
procedure for your Load/Store function.

1. Create your function `/src/myfunc/MyEvalFunc.java`

{{{
package myfunc;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;

public class MyEvalFunc extends EvalFunc
{
//@Override
public void exec(Tuple input, DataBag output) throws IOException
{
String str = input.getAtomField(0).strval();
StringTokenizer tok = new StringTokenizer(str, " \",()*", 
false);
while (tok.hasMoreTokens())
{
output.add(new Tuple(tok.nextToken()));
}
}
}
}}}

2. Compile your function. Make sure to point java compiler to pig jar file.

{{{
/src/myfunc $ javac -classpath /src/pig.jar MyEvalFunc.java
}}}

3. Create jar file 

{{{
/src/myfunc $ cd ..
/src $  jar cf myfunc.jar myfunc
}}}

4. Use the function through grunt (similar use from script). Note that there is 
no quotes around path in the `register` call.

{{{
/src $ java -jar pig.jar -
grunt> register /src/myfunc.jar
grunt> A = load 'students' using PigStorage('\t');
grunt> B = foreach A generate myfunc.MyEvalFunc($0);
grunt> dump B;
({(joe smith)})
({(john adams)})
({(anne white)})