from:"Apache Wiki"

[Pig Wiki] Update of "FrontPage" by OlgaN

2009-09-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/FrontPage

--
  
  [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig 
Training] - Complete with video lectures, exercises, and a pre-configured 
virtual machine. Developed by Cloudera and Yahoo!
  
- Pig Latin Editors, Pig Python wrappers, Pig available on Amazon, and other 
tools, see PigTools
- 
  == Developer Documentation ==
   * How tos
* HowToDocumentation

[Pig Wiki] Update of "FrontPage" by GregStein

2009-09-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by GregStein:
http://wiki.apache.org/pig/FrontPage

The comment on the change is:
Restore useful information. And it shouldn't be just a vendor link.

--
  
  http://hadoop.apache.org/pig/
  
+ (./) Check it out ... updates and new additions.
+ 
+  * New to Pig? Getting Started ...
+   1. PigOverview - An overview of Pig's capabilities
+   1. [http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html Pig Quick 
Start] - How to build and run Pig
+   1. [http://hadoop.apache.org/pig/docs/r0.3.0/tutorial.html Pig Tutorial]- 
Tackle a real task with pig, start to finish
- [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig 
Training] - Complete with video lectures, exercises, and a pre-configured 
virtual machine. Developed by Cloudera and Yahoo!
+   1. [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig 
Training] - Complete with video lectures, exercises, and a pre-configured 
virtual machine. Developed by Cloudera and Yahoo!
+ 
+  * Pig Language
+ 
+   * [http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html Pig Latin 
Reference Manual] - Includes Pig Latin, built-in functions, and shell commands
+ 
+   * Pig Functions
+   * PiggyBank - User-defined functions (UDFs) contributed by Pig users!
+   * [http://hadoop.apache.org/pig/docs/r0.3.0/udf.html UDF Manual] - Write 
your own UDFs
+ 
+  * (./) Pig Latin Editors, Pig Python wrappers, and other tools, see PigTools
+ 
+  * More Pig
+   * [http://hadoop.apache.org/pig/docs/r0.3.0/cookbook.html Apache Pig 
Cookbook] - Want Pig to fly? Tips and tricks on how to write efficient Pig 
scripts
+   * [http://hadoop.apache.org/pig/javadoc/docs/api/ Javadocs] - Refer to the 
Javadocs for embedded Pig and UDFs
+   * [http://wiki.apache.org/pig/FAQ FAQ] - The answer to your question may be 
here
+ 
  
  == Developer Documentation ==
   * How tos

[Pig Wiki] Update of "FrontPage" by OlgaN

2009-09-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/FrontPage

The comment on the change is:
Removing duplicate links to the documentation per discussion on the user list. 

--
   
  == User Documentation ==
  
+  * [http://hadoop.apache.org/pig/ User Documentation]
- http://hadoop.apache.org/pig/
- 
- (./) Check it out ... updates and new additions.
- 
-  * New to Pig? Getting Started ...
-   1. PigOverview - An overview of Pig's capabilities
-   1. [http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html Pig Quick 
Start] - How to build and run Pig
-   1. [http://hadoop.apache.org/pig/docs/r0.3.0/tutorial.html Pig Tutorial]- 
Tackle a real task with pig, start to finish
-   1. [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig 
Training] - Complete with video lectures, exercises, and a pre-configured 
virtual machine. Developed by Cloudera and Yahoo!
+  * [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig 
Training] - Complete with video lectures, exercises, and a pre-configured 
virtual machine. Developed by Cloudera and Yahoo!
- 
-  * Pig Language
- 
-   * [http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html Pig Latin 
Reference Manual] - Includes Pig Latin, built-in functions, and shell commands
- 
-   * Pig Functions
-   * PiggyBank - User-defined functions (UDFs) contributed by Pig users!
+  * PiggyBank - User-defined functions (UDFs) contributed by Pig users!
-   * [http://hadoop.apache.org/pig/docs/r0.3.0/udf.html UDF Manual] - Write 
your own UDFs
- 
-  * (./) Pig Latin Editors, Pig Python wrappers, and other tools, see PigTools
- 
-  * More Pig
-   * [http://hadoop.apache.org/pig/docs/r0.3.0/cookbook.html Apache Pig 
Cookbook] - Want Pig to fly? Tips and tricks on how to write efficient Pig 
scripts
-   * [http://hadoop.apache.org/pig/javadoc/docs/api/ Javadocs] - Refer to the 
Javadocs for embedded Pig and UDFs
-   * [http://wiki.apache.org/pig/FAQ FAQ] - The answer to your question may be 
here
- 
  
  == Developer Documentation ==
   * How tos

[Pig Wiki] Update of "FrontPage" by AlanGates

2009-09-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/FrontPage

--
  '''Interested in Pig Guts?''' We are completely redesigning the Pig execution 
and optimization framework. For design details see PigOptimizationWishList and 
PigExecutionModel. 
  
  '''Want to contribute but don't know where to kick in?''' Here is a 
[http://wiki.apache.org/pig/ProposedProjects list of project] that we desired. 
We need new blood! 
+ 
+ '''Pig available as part of Amazon's Elastic !MapReduce''', as of August 2009.
  
  == General Information ==

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecification" by daijy

2009-09-14 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification

--
  ||2174||Internal exception. Could not create the sampler job. ||
  ||2175||Internal error. Could not retrieve file size for the sampler. ||
  ||2176||Error processing right input during merge join||
+ ||2177||Prune column optimization: Cannot retrieve operator from null or 
empty list||
+ ||2178||Prune column optimization: The matching node from the optimizor 
framework is null||
+ ||2179||Prune column optimization: Error while performing checks to prune 
columns.||
+ ||2180||Prune column optimization: Only LOForEach and LOSplit are expected||
+ ||2181||Prune column optimization: Unable to prune columns.||
+ ||2182||Prune column optimization: Only relational operator can be used in 
column prune optimization.||
+ ||2183||Prune column optimization: LOLoad must be the root logical operator.||
+ ||2184||Prune column optimization: Fields list inside RequiredFields is 
null.||
+ ||2185||Prune column optimization: Unable to prune columns when processing 
node||
+ ||2186||Prune column optimization: Cannot locate node from successor||
+ ||2187||Column pruner: Cannot get predessors||
+ ||2188||Column pruner: Cannot prune columns||
+ ||2189||Column pruner: Expect schema||
+ ||2190||PruneColumns: Cannot find predecessors for logical operator||
+ ||2191||PruneColumns: No input to prune||
+ ||2192||PruneColumns: Column to prune does not exist||
+ ||2193||PruneColumns: Foreach can only have 1 predecessor||
+ ||2194||PruneColumns: Expect schema||
+ ||2195||PruneColumns: Fail to visit foreach inner plan||
+ ||2196||RelationalOperator: Exception when traversing inner plan||
+ ||2197||RelationalOperator: Cannot drop column which require *||
+ ||2198||LOLoad: load only take 1 input||
+ ||2199||LOLoad: schema mismatch||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||

[Pig Wiki] Update of "LoadStoreRedesignProposal" by AlanGates

2009-09-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/LoadStoreRedesignProposal

New page:
= Proposed Redesign For Load, Store, and Slicer in Pig =

== Goals ==
The current design of !LoadFunc, !StoreFunc, and the Slicer interfaces in Pig 
are not adequate.  This proposed redesign has the following goals:

 1. The Slicer interface is redundant.  Remove it and allow users to directly 
use Hadoop !InputFormats in Pig.
 1. It is not currently easy to use a separate !OutputFormat for a !StoreFunc.  
This should be made easy to allow users to store data into locations other than 
HDFS.
 1. Currently users that wish to operate on Pig and Map-Reduce are required to 
write Hadoop !InputFormat and !OutpuFormat as well as a Pig loader and Pig 
storage functions.  While Pig load and store functions will always be necessary 
to take the most advantage of Pig, it would be good for users to be able to use 
Hadoop !InputFormat and !OutputFormat classes directly to minimize the data 
interchange cost.
 1. The major difference between a Hadoop !InputFormat and a Pig load function 
is the data model.  Hadoop views data as key-value pairs, Pig as a tuple.  
Similarly for !OutputFormat and store functions.  
 1. New storage formats such as Zebra are being implemented for Hadoop that 
include metadata information such as schema, etc.  The !LoadFunc interface 
needs to allow Pig to obtain this metadata.  There is a describeSchema call in 
the current interface.  More functions may be necessary.
 1. These new storage formats also plan to support pushing of, at least, 
projection and selection into the storage layer.  Pig needs to be able to query 
loaders to determine what if any pushdown capabilities they support and then 
make use of those capabilities.
 1. There already exists one metadata system in Hadoop (Hive's metastore) and 
there is a proposal to add another (Owl).  Pig needs to be able to query these 
metadata systems for information about data to be read.  It also needs to be 
able to record information to these metadata systems when writing data.  The 
load and store functions are a reasonable place to do these operations since 
that is the point at which Pig is reading and writing data.  This will also 
allow Pig to read and write data from and to multiple metadata stores in single 
Pig Latin scripts if that is desired.

A requirement for the implementation that does not fit into the goals above is 
that while the existing Pig implementation is tightly tied to
Hadoop (and is becoming more tightly tied all the time), we do not want to tie 
Pig Latin tightly to Hadoop.  Therefore while we plan to allow
users to easily interact with Hadoop !InputFormats and !OutputFormats, these 
should not be exposed as such to Pig Latin.  Pig Latin must still
view these as load and store functions; it will only be the underlying 
implementation that will realize that they are Hadoop classes and handle
them appropriately.

== Interfaces ==
With these proposed changes, load and store functions in Pig are becoming very 
weighty objects.  The current !LoadFunc interface already
provides mechanisms for reading the data, getting some schema information, 
casting data, and some place holders for pushing down projections into
the loader.  This proposal will add more file level metadata, global metadata, 
selection push down, plus interaction with !InputFormats.  It will
also add !OutputFormats to store functions.  If we create two monster 
interfaces that attempt to provide everything, the burden of creating a
new load or store function in Pig will become overwhelming.  Instead, this 
proposal envisions splitting the interface into a number of
interfaces, each with a clear responsibility.  Load and store functions will 
then only be required to implement the interfaces for functionality they offer.

For load functions:
 * !LoadFunc will be pared down to just contain functions directly associated 
with reading data, such as getNext.
 * A new !LoadCaster interface will be added.  This interface will contain all 
of the bytesToX methods currently in !LoadFunc.  !LoadFunc will add a getCaster 
routine, that will return an object that can provide casts.  The existing 
UTF8!StorageConverter class will change to implement this interface.  Load 
functions will then be free to use this class as their caster, or provide their 
own.  For existing load functions that provide all of the bytesToX methods, 
they can implement the !LoadCaster interface and return themselves from the 
getCaster routine.  If a loader does not provide a !LoadCaster, casts from byte 
array to other pig types will not be supported for data loaded via that loader.
 * A new !LoadMetadata interface will be added.  Calls that find metadata about 
the data being loaded, such as determineSchema, will be placed in this 
interface.  If a loader does not im

[Pig Wiki] Update of "MetadataInterfaceProposal" by AlanGates

2009-09-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/MetadataInterfaceProposal

New page:
= Proposed Design for Pig Metadata Interface =
With the introduction of SQL, Pig needs to be able to communicate with external 
metadata services.  These communications will includes such
operations as creating, altering, and dropping databases, tables, etc.  It will 
also include metadata queries, such as requests to show available
tables, etc.  DDL operations of these sorts will be beyond the scope of the 
proposed metadata
interfaces for load and storage functions.  However, Pig should not be tightly 
tied to a single metadata implementation.  It should be able to
work with Owl, Hive's metastore, or any other metadata source that is added to 
Hadoop.  To this end this document proposes an interface for
operating with metadata systems.  Different metadata connectors can then be 
implemented, one for each metadata system.

== Interface ==

This interface will allow users to find information about tables, databases, 
etc. in the metadata store.  For each call, it will pass the portion
of the syntax tree relavant to the operation to the metadata connector.  These 
structures will be versioned.

{{{

/**
 * An interface to encapsulate DDL operations.
 */
interface MetadataDDL {
void createTable(CreateTable ct) throws IOException;
void alterTable(AlterTable at) throws IOException;  // includes add and 
drop partition
void dropTable(DropTable dt) throws IOException;
SQLTable[] showTables(Database db) throws IOException;  // info 
returned in SQLTable includes info on partitions

void createDatabase(CreateDatabase cd) throws IOException;
void alterDatabase(AlterDatabase ad) throws IOException;
void dropDatabase(DropDatabase dd) throws IOException;
SQLDatabase[] showDatabases() throws IOException;
}


}}}


== Accessing Global Metadata From SQL ==
Pig will be configured to work with one global metadata source for a given set 
of SQL operations.  This configuration will be via Pig's
configuration file.  It will specify the URI of the server to use and the 
implementation of !MetadataDDL to use with this server.

== Accessing Global Metadata from Pig Latin ==
Pig Latin will not support a call to metadata within the language itself.  
Instead, it will support the ability to invoke a SQL DDL command.
This SQL will then be sent to the SQL parser and dispatched through the 
metadata service as before.

{{{
A = load ...
...
SQL {"create table myTable ..."};
store Z into 'myTable' using OwlStorage();
}}}

[Pig Wiki] Update of "LoadStoreRedesignProposal" by AlanGates

2009-09-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/LoadStoreRedesignProposal

--
  
   1. The Slicer interface is redundant.  Remove it and allow users to directly 
use Hadoop !InputFormats in Pig.
   1. It is not currently easy to use a separate !OutputFormat for a 
!StoreFunc.  This should be made easy to allow users to store data into 
locations other than HDFS.
-  1. Currently users that wish to operate on Pig and Map-Reduce are required 
to write Hadoop !InputFormat and !OutpuFormat as well as a Pig loader and Pig 
storage functions.  While Pig load and store functions will always be necessary 
to take the most advantage of Pig, it would be good for users to be able to use 
Hadoop !InputFormat and !OutputFormat classes directly to minimize the data 
interchange cost.
+  1. Currently users that wish to operate on Pig and Map-Reduce are required 
to write Hadoop !InputFormat and !OutputFormat as well as a Pig load and 
storage functions.  While Pig load and store functions will always be necessary 
to take the most advantage of Pig, it would be good for users to be able to use 
Hadoop !InputFormat and !OutputFormat classes directly to minimize the data 
interchange cost.
-  1. The major difference between a Hadoop !InputFormat and a Pig load 
function is the data model.  Hadoop views data as key-value pairs, Pig as a 
tuple.  Similarly for !OutputFormat and store functions.  
   1. New storage formats such as Zebra are being implemented for Hadoop that 
include metadata information such as schema, etc.  The !LoadFunc interface 
needs to allow Pig to obtain this metadata.  There is a describeSchema call in 
the current interface.  More functions may be necessary.
   1. These new storage formats also plan to support pushing of, at least, 
projection and selection into the storage layer.  Pig needs to be able to query 
loaders to determine what if any pushdown capabilities they support and then 
make use of those capabilities.
   1. There already exists one metadata system in Hadoop (Hive's metastore) and 
there is a proposal to add another (Owl).  Pig needs to be able to query these 
metadata systems for information about data to be read.  It also needs to be 
able to record information to these metadata systems when writing data.  The 
load and store functions are a reasonable place to do these operations since 
that is the point at which Pig is reading and writing data.  This will also 
allow Pig to read and write data from and to multiple metadata stores in single 
Pig Latin scripts if that is desired.
@@ -22, +21 @@

  == Interfaces ==
  With these proposed changes, load and store functions in Pig are becoming 
very weighty objects.  The current !LoadFunc interface already
  provides mechanisms for reading the data, getting some schema information, 
casting data, and some place holders for pushing down projections into
- the loader.  This proposal will add more file level metadata, global 
metadata, selection push down, plus interaction with !InputFormats.  It will
+ the loader.  This proposal will add more file level metadata, selection push 
down, plus interaction with !InputFormats.  It will
  also add !OutputFormats to store functions.  If we create two monster 
interfaces that attempt to provide everything, the burden of creating a
  new load or store function in Pig will become overwhelming.  Instead, this 
proposal envisions splitting the interface into a number of
  interfaces, each with a clear responsibility.  Load and store functions will 
then only be required to implement the interfaces for functionality they offer.
  
  For load functions:
-  * !LoadFunc will be pared down to just contain functions directly associated 
with reading data, such as getNext.
+  * '''!LoadFunc''' will be pared down to just contain functions directly 
associated with reading data, such as getNext.
-  * A new !LoadCaster interface will be added.  This interface will contain 
all of the bytesToX methods currently in !LoadFunc.  !LoadFunc will add a 
getCaster routine, that will return an object that can provide casts.  The 
existing UTF8!StorageConverter class will change to implement this interface.  
Load functions will then be free to use this class as their caster, or provide 
their own.  For existing load functions that provide all of the bytesToX 
methods, they can implement the !LoadCaster interface and return themselves 
from the getCaster routine.  If a loader does not provide a !LoadCaster, casts 
from byte array to other pig types will not be supported for data loaded via 
that loader.
+  * A new '''!LoadCaster''' interface will be added.  This interface will 
contain all of the bytesToX methods currently in !LoadFunc.  !LoadFunc will add 
a `getCaster` routine, that will return an object that

[Pig Wiki] Update of "MetadataInterfaceProposal" by AlanGates

2009-09-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/MetadataInterfaceProposal

--
  = Proposed Design for Pig Metadata Interface =
- With the introduction of SQL, Pig needs to be able to communicate with 
external metadata services.  These communications will includes such
+ With the introduction of SQL, Pig needs to be able to communicate with 
external metadata services.  These communications will include such
  operations as creating, altering, and dropping databases, tables, etc.  It 
will also include metadata queries, such as requests to show available
  tables, etc.  DDL operations of these sorts will be beyond the scope of the 
proposed metadata
  interfaces for load and storage functions.  However, Pig should not be 
tightly tied to a single metadata implementation.  It should be able to
- work with Owl, Hive's metastore, or any other metadata source that is added 
to Hadoop.  To this end this document proposes an interface for
+ work with Owl, Hive's metastore, or any other metadata source that is added 
to Hadoop.  To this end, this document proposes an interface for
  operating with metadata systems.  Different metadata connectors can then be 
implemented, one for each metadata system.
  
  == Interface ==
- 
  This interface will allow users to find information about tables, databases, 
etc. in the metadata store.  For each call, it will pass the portion
  of the syntax tree relavant to the operation to the metadata connector.  
These structures will be versioned.
  
@@ -39, +38 @@

  configuration file.  It will specify the URI of the server to use and the 
implementation of !MetadataDDL to use with this server.
  
  == Accessing Global Metadata from Pig Latin ==
- Pig Latin will not support a call to metadata within the language itself.  
Instead, it will support the ability to invoke a SQL DDL command.
+ Pig Latin will not support a call to metadata within the language itself.  
Instead, it will support the ability to invoke a Pig SQL DDL command.
  This SQL will then be sent to the SQL parser and dispatched through the 
metadata service as before.
  
  {{{

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecification" by daijy

2009-09-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification

--
  ||2197||RelationalOperator: Cannot drop column which require *||
  ||2198||LOLoad: load only take 1 input||
  ||2199||LOLoad: schema mismatch||
+ ||2200||PruneColumns: Error getting top level project||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates

2009-09-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by AlanGates:
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=3&rev2=4

   * @param reader RecordReader to be used by this instance of the LoadFunc
   */
  void prepareToRead(RecordReader reader);
+ 
+ /**
+  * Called after all reading is finished.
+  */
+ void doneReading();
  
  /**
   * Retrieves the next tuple to be processed.
@@ -289, +294 @@

  void prepareToWrite(RecordWriter writer);
  
  /**
+  * Called when all writing is finished.
+  */
+ void doneWriting();
+ 
+ /**
   * Write a tuple the output stream to which this instance was
   * previously bound.
   *

[Pig Wiki] Update of "ProposedRoadMap" by AlanGates

2009-09-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "ProposedRoadMap" page has been changed by AlanGates:
http://wiki.apache.org/pig/ProposedRoadMap?action=diff&rev1=3&rev2=4

- <>
  = Pig Road Map =
  
  The following document was developed as a roadmap for pig at Yahoo prior to 
pig being released as open source.

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates

2009-09-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by AlanGates:
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=4&rev2=5

  interface LoadFunc {
  
  /**
-  * Communicate to the loader the URIs used in Pig Latin to refer to the 
+  * Communicate to the loader the load string used in Pig Latin to refer 
to the 
   * object(s) being loaded.  That is, if the PL script is
   * A = load 'bla'
-  * then 'bla' is the URI.  Load functions should assume that if no
-  * scheme is provided in the URI it is an hdfs file.  This will be 
+  * then 'bla' is the load string.  In general Pig expects these to be
+  * a path name, a glob, or a URI.  If there is no URI scheme present,
+  * Pig will assume it is a file name.  This will be 
   * called during planning on the front end, not during execution on
   * the backend.
-  * @param uri URIs referenced in load statement.
+  * @param location Location indicated in load statement.
+  * @throws IOException if the location is not valid.
   */
- void setURI(URI[] uri);
+ void setLocation(String location) throws IOException;
  
  /**
   * Return the InputFormat associated with this loader.  This will be
   * called during planning on the front end.  The LoadFunc need not
   * carry the InputFormat information to the backend, as it will
-  * be provided with the appropriate RecordReader there.
+  * be provided with the appropriate RecordReader there.  This is the
+  * instance of InputFormat (rather than the class name) because the 
+  * load function may need to instantiate the InputFormat in order 
+  * to control how it is constructed.
   */
  InputFormat getInputFormat();
  
@@ -77, +82 @@

  
  /**
   * Initializes LoadFunc for reading data.  This will be called during 
execution
-  * before any calls to getNext.
+  * before any calls to getNext.  The RecordReader needs to be passed here 
because
+  * it has been instantiated for a particular InputSplit.
   * @param reader RecordReader to be used by this instance of the LoadFunc
   */
  void prepareToRead(RecordReader reader);
@@ -100, +106 @@

  }}}
  
  Open questions for !LoadFunc:
-  1. Should setURI instead be setLocation and just take a String?  The 
advantage of a URI is we know exactly what users are trying to communicate 
with, and we can define what Pig does in default cases (when a scheme is not 
given).  The disadvantage is forcing more structure on users and their load 
functions.  I'm still pretty strongly on the side of using URI.
+  1. Should setLocation instead be setURI and take a URI?  The advantage of a 
URI is we know exactly what users are trying to communicate with, and we can 
define what Pig does in default cases (when a scheme is not given).  The 
disadvantage is forcing more structure on users and their load functions.
  
  The '''!LoadCaster''' interface will include bytesToInt, bytesToLong, etc. 
functions
  currently in !LoadFunc.  UTF8!StorageConverter will implement this interface.
@@ -121, +127 @@

   * not possible to return a schema that represents all returned data,
   * then null should be returned.
   */
- LoadSchema getSchema();
+ ResourceSchema getSchema();
  
  /**
   * Get statistics about the data to be loaded.  If no statistics are
   * available, then null should be returned.
   */
- LoadStatistics getStatistics();
+ ResourceStatistics getStatistics();
+ 
+ /**
+  * Find what columns are partition keys for this input.
+  * This function assumes that setLocation has already been called.
+  * @return array of field names of the partition keys.
+  */
+ String>[] getPartitionKeys();
+ 
+ /**
+  * Set the filter for partitioning.  It is assumed that this filter
+  * will only contain references to fields given as partition keys in
+  * getPartitionKeys
+  * @param plan that describes filter for partitioning
+  * @throws IOException if the filter is not compatible with the storage
+  * mechanism or contains non-partition fields.
+  */
+ void setParitionFilter(OperatorPlan plan) throws IOException;
  
  }
  
  }}}
  
- '''!LoadSchema''' will be a top level object (`org.apache.pig.LoadSchema`) 
used to communicate information about data to be loaded or that is being
+ '''!ResourceSchema''' will be a top level object 
(`org.apache.pig.ResourceSchema`) used to communicate information about data to 
be loaded or that is being
  stored.  It is not the same as the existing 
`org.apache.pig.impl.logicalLayer.schema.Schema`.
  
  {{{
- public class LoadSchema {
+ public class ResourceSchema {
  
  int version;
  
- public class LoadFieldSchema {
+ public class ResourceFieldSchem

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-09-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=122&rev2=123

  ||2182||Prune column optimization: Only relational operator can be used in 
column prune optimization.||
  ||2183||Prune column optimization: LOLoad must be the root logical operator.||
  ||2184||Prune column optimization: Fields list inside RequiredFields is 
null.||
- ||2185||Prune column optimization: Unable to prune columns when processing 
node||
  ||2186||Prune column optimization: Cannot locate node from successor||
  ||2187||Column pruner: Cannot get predessors||
  ||2188||Column pruner: Cannot prune columns||

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates

2009-09-25 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by AlanGates:
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=5&rev2=6

  void prepareToWrite(RecordWriter writer);
  
  /**
-  * Called when all writing is finished.
+  * Called when all writing is finished.  This will be called on the 
backend,
+  * once for each writing task.
   */
  void doneWriting();
  
@@ -330, +331 @@

   * @throws IOException
   */
  void putNext(Tuple t) throws IOException;
+ 
+ /**
+  * Called when writing all of the data is finished.  This can be used
+  * to commit information to a metadata system, clean up tmp files, 
+  * close connections, etc.  This call will be made on the front end
+  * after all back end processing is finished.
+  */
+ void allFinished();
+ 
+ 
  
  }
  
@@ -461, +472 @@

  == Changes ==
  Sept 23 2009, Gates
   * Changed setURI to setLocation in !LoadFunc and !StoreFunc.  Also changed 
it to throw IOException in the cases where the passed in location is not valid 
for this load or store mechanism.
-  * Changed LoadSchema to ResourceSchema and LoadStatistics to 
ResourceStatistics
+  * Changed !LoadSchema to !ResourceSchema and !LoadStatistics to 
!ResourceStatistics
-  * Added getPartitionKeys and setPartitionFilter to LoadMetadata
+  * Added getPartitionKeys and setPartitionFilter to !LoadMetadata
  
+ Sept 25 2009, Gates
+  * Added allFinished call to !StoreFunc
+

[Pig Wiki] Update of "MetadataInterfaceProposal" by Ala nGates

2009-09-25 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "MetadataInterfaceProposal" page has been changed by AlanGates:
http://wiki.apache.org/pig/MetadataInterfaceProposal?action=diff&rev1=3&rev2=4

  void alterDatabase(AlterDatabase ad) throws IOException;
  void dropDatabase(DropDatabase dd) throws IOException;
  SQLDatabase[] showDatabases() throws IOException;
+ 
+ /**
+  * Get the default load function for this metadata service.  This 
+  * will be called by SQL to determine the right load function for
+  * the metadata service it is connected to.
+  * @return class name of the default load function for this interface.
+  */
+ String getLoaderClass();
+ 
+ /**
+  * Get the default storage function for this metadata service.  This 
+  * will be called by SQL to determine the right storage function for
+  * the metadata service it is connected to.
+  * @return class name of the default storage function for this 
interface.
+  */
+ String getStorageClass();
+ 
+ 
  }
  
  
@@ -48, +66 @@

  store Z into 'myTable' using OwlStorage();
  }}}
  
+ == Changes ==
+ Septemer 25 2009
+  * Added getLoaderClass and getStorageClass to interface, Gates.
+

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates

2009-09-29 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by AlanGates:
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=6&rev2=7

  type conversion on this data will be done in the same way as noted above for 
!InputFormatLoader.
  
  Open Questions:
-  1. Does all this force us to switch to Hadoop for local mode as well?  We 
aren't opposed to using Hadoop for local mode it just needs to get reasonable 
fast.  Can we use !InputFormat ''et. al.'' on local files without using the 
whole HDFS structure?
+  1. Does all this force us to switch to Hadoop for local mode as well?  We 
aren't opposed to using Hadoop for local mode it just needs to get reasonable 
fast.  Can we use !InputFormat ''et. al.'' on local files without using the 
whole HDFS structure?  '''Answer''' According to Hadoop documentation 
!TextInputFormat works on local files as well as hdfs files.  We may need to 
catch that we are in local mode and change the filename to `file://`
+  1. How will we worked with compressed files?  !FileInputFormat already works 
with bzip and gzip compressed files, producing reasonable splits.  !PigStorage 
will be reworked to depend on !FileInputFormat (or a descendant thereof, see 
next item) and should therefore be able to use this functionality.
+  1. How will the need for mark and seek in index construction for merge join 
be handled?  In the long term we'd like Hadoop to handle this for us by 
creating a !SeekableInputFormat that would add this functionality.  In the 
meantime we can extend !FileInputFormat to !PigFileInputFormat.  We can add 
getPos() call to this class that will provide a position to start reading at to 
find the tuple being indexed.  Note that this position will not necessarily be 
the exact position of the tuple, but a position from which the tuple can be 
found.  We can also change the getSplits call on this method to return a split 
that is specific to a given position so that it can be used during the join.
  
  == Changes ==
  Sept 23 2009, Gates
@@ -478, +480 @@

  Sept 25 2009, Gates
   * Added allFinished call to !StoreFunc
  
+ Sept 29 2009, Gates
+  * Added answer for open question 1.  Added and answered open questions 2 and 
3.
+

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-09-30 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=123&rev2=124

  ||1103||Merge join only supports Filter, Foreach and Load as its predecessor. 
Found : ||
  ||1104||Right input of merge-join must implement SamplableLoader interface. 
This loader doesn't implement it.||
  ||1105||Heap percentage / Conversion factor cannot be set to 0 ||
+ ||1106||Merge join is possible only for simple column or '*' join keys when 
using  as the loader ||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-10-01 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=124&rev2=125

  ||2198||LOLoad: load only take 1 input||
  ||2199||LOLoad: schema mismatch||
  ||2200||PruneColumns: Error getting top level project||
+ ||2201||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-10-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=125&rev2=126

  ||2198||LOLoad: load only take 1 input||
  ||2199||LOLoad: schema mismatch||
  ||2200||PruneColumns: Error getting top level project||
- ||2201||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||
@@ -420, +419 @@

  ||4007||Missing  from hadoop configuration||
  ||4008||Failed to create local hadoop file ||
  ||4009||Failed to copy data to local hadoop file ||
+ ||4010||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||6000||The output file(s):   already exists||
  ||6001||Cannot read from the storage where the output  will be 
stored||
  ||6002||Unable to obtain a temporary path.||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-10-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=126&rev2=127

  ||1104||Right input of merge-join must implement SamplableLoader interface. 
This loader doesn't implement it.||
  ||1105||Heap percentage / Conversion factor cannot be set to 0 ||
  ||1106||Merge join is possible only for simple column or '*' join keys when 
using  as the loader ||
+ ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||
@@ -419, +420 @@

  ||4007||Missing  from hadoop configuration||
  ||4008||Failed to create local hadoop file ||
  ||4009||Failed to copy data to local hadoop file ||
- ||4010||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||6000||The output file(s):   already exists||
  ||6001||Cannot read from the storage where the output  will be 
stored||
  ||6002||Unable to obtain a temporary path.||

[Pig Wiki] Update of "FrontPage" by yinghe

2009-10-07 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "FrontPage" page has been changed by yinghe:
http://wiki.apache.org/pig/FrontPage?action=diff&rev1=142&rev2=143

* PigErrorHandling
* PigMultiQueryPerformanceSpecification
* PigSkewedJoinSpec
+   * PigAccumulatorSpec
* PigSampler
* Performance
* PigPerformance (current performance numbers)

[Pig Wiki] Update of "FrontPage" by yinghe

2009-10-07 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "FrontPage" page has been changed by yinghe:
http://wiki.apache.org/pig/FrontPage?action=diff&rev1=143&rev2=144

* PigErrorHandling
* PigMultiQueryPerformanceSpecification
* PigSkewedJoinSpec
-   * PigAccumulatorSpec
+   * PigAccumulatorUDF
* PigSampler
* Performance
* PigPerformance (current performance numbers)

[Pig Wiki] Update of "FrontPage" by yinghe

2009-10-07 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "FrontPage" page has been changed by yinghe:
http://wiki.apache.org/pig/FrontPage?action=diff&rev1=144&rev2=145

* PigErrorHandling
* PigMultiQueryPerformanceSpecification
* PigSkewedJoinSpec
-   * PigAccumulatorUDF
+   * PigAccumulatorSpec
* PigSampler
* Performance
* PigPerformance (current performance numbers)

[Pig Wiki] Update of "MarkMeissonnier" by MarkMeissonni er

2009-10-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "MarkMeissonnier" page has been changed by MarkMeissonnier:
http://wiki.apache.org/pig/MarkMeissonnier

New page:
#format wiki
#language en
== Mark Meissonnier ==

I am a software engineer who arrived in Silicon Valley on March 2000 (which for 
the anecdote was 1 month after Nasdaq hit it's alltime high of 5000 
points...What is it today?)




CategoryHomepage

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-10-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=127&rev2=128

  ||1105||Heap percentage / Conversion factor cannot be set to 0 ||
  ||1106||Merge join is possible only for simple column or '*' join keys when 
using  as the loader ||
  ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
+ ||1108||Duplicated schema||
- ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
+ ||20008||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||
  ||2003||Cannot read from the storage where the output  will be 
stored||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-10-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=128&rev2=129

  ||1106||Merge join is possible only for simple column or '*' join keys when 
using  as the loader ||
  ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||1108||Duplicated schema||
- ||20008||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
+ ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||
  ||2003||Cannot read from the storage where the output  will be 
stored||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-10-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy:
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=129&rev2=130

  ||2198||LOLoad: load only take 1 input||
  ||2199||LOLoad: schema mismatch||
  ||2200||PruneColumns: Error getting top level project||
+ ||2201||Could not validate schema alias||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-10-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath:
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=7&rev2=8

   * to commit information to a metadata system, clean up tmp files, 
   * close connections, etc.  This call will be made on the front end
   * after all back end processing is finished.
+  * @param conf The job configuration 
   */
- void allFinished();
+ void allFinished(Configuration conf);

[Pig Wiki] Update of "PigMix" by AlanGates

2009-10-26 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigMix" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigMix?action=diff&rev1=11&rev2=12

--

  || PigMix_12 || 156 || 160.67 || 0.97 ||
  || Total || 2440.67 || 2001.67 || 1.22 ||
  
+ Run date:  October 18, 2009, run against top of trunk as of that day.
+ With this run we included a new measure, weighted average.  Our previous 
multiplier that we have been publishing takes the total time of running all 12 
Pig Latin scripts and compares it to the total time of
+ running all 12 Java Map Reduce programs.  This is a valid way to measure, as 
it shows the total amount of time to do all these operations on both platforms. 
 But it has the drawback that it gives more weight to
+ long running operations (such as joins and order bys) while masking the 
performance in faster operations such as group bys.  The new "weighted average" 
adds up the multiplier for each Pig Latin script vs. Java
+ program separately and then divides by 12, thus weighting each test equally.  
In past runs the weighted average had significantly lagged the overall average 
(for example, in the run above for August 27 it 
+ was 1.5 even though the total difference was 1.2).  With this latest run it 
still lags some, but the gap has shrunk noticably.
+ 
+ || Test  || Pig run time || Java run time || Multiplier ||
+ || PigMix_1  || 135.0|| 133.0 || 1.02   ||
+ || PigMix_2  || 46.67|| 39.33 || 1.19   ||
+ || PigMix_3  || 184.0|| 98.0  || 1.88   ||
+ || PigMix_4  || 71.67|| 77.67 || 0.92   ||
+ || PigMix_5  || 70.0 || 83.0  || 0.84   ||
+ || PigMix_6  || 76.67|| 61.0  || 1.26   ||
+ || PigMix_7  || 71.67|| 61.0  || 1.17   ||
+ || PigMix_8  || 43.33|| 47.67 || 0.91   ||
+ || PigMix_9  || 184.0|| 209.33|| 0.88   ||
+ || PigMix_10 || 268.67   || 283.0 || 0.95   ||
+ || PigMix_11 || 145.33   || 168.67|| 0.86   ||
+ || PigMix_12 || 55.33|| 95.33 || 0.58   ||
+ || Total || 1352.33  || 1357  || 1.00   ||
+ Weighted Average:  1.04
  
  
  == Features Tested ==

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=8&rev2=9

--

  '''!LoadFunc'''
  
  {{{
+ 
  /**
   * This interface is used to implement functions to parse records
   * from a dataset.
   */
- interface LoadFunc {
+ public interface LoadFunc {
+ /**
+  * This method is called by the Pig runtime in the front end to convert 
the
+  * input location to an absolute path if the location is relative. The
+  * loadFunc implementation is free to choose how it converts a relative 
+  * location to an absolute location since this may depend on what the 
location
+  * string represent (hdfs path or some other data source)
+  * 
+  * @param location location as provided in the "load" statement of the 
script
+  * @param curDir the current working direction based on any "cd" 
statements
+  * in the script before the "load" statement
+  * @return the absolute location based on the arguments passed
+  * @throws IOException if the conversion is not possible
+  */
+ String relativeToAbsolutePath(String location, String curDir) throws 
IOException;
  
  /**
   * Communicate to the loader the load string used in Pig Latin to refer 
to the 
-  * object(s) being loaded.  That is, if the PL script is
-  * A = load 'bla'
-  * then 'bla' is the load string.  In general Pig expects these to be
-  * a path name, a glob, or a URI.  If there is no URI scheme present,
-  * Pig will assume it is a file name.  This will be 
-  * called during planning on the front end, not during execution on
-  * the backend.
-  * @param location Location indicated in load statement.
+  * object(s) being loaded.  The location string passed to the LoadFunc 
here 
+  * is the return value of {...@link 
LoadFunc#relativeToAbsolutePath(String, String)}
+  * 
+  * This method will be called in the backend multiple times. 
Implementations
+  * should bear in mind that this method is called multiple times and 
should
+  * ensure there are no inconsistent side effects due to the multiple 
calls.
+  * 
+  * @param location Location as returned by 
+  * {...@link LoadFunc#relativeToAbsolutePath(String, String)}.
+  * @param job the {...@link Job} object
   * @throws IOException if the location is not valid.
   */
- void setLocation(String location) throws IOException;
+ void setLocation(String location, Job job) throws IOException;
  
  /**
+  * This will be called during planning on the front end. This is the
-  * Return the InputFormat associated with this loader.  This will be
-  * called during planning on the front end.  The LoadFunc need not
-  * carry the InputFormat information to the backend, as it will
-  * be provided with the appropriate RecordReader there.  This is the
   * instance of InputFormat (rather than the class name) because the 
   * load function may need to instantiate the InputFormat in order 
   * to control how it is constructed.
+  * @return the InputFormat associated with this loader.
+  * @throws IOException if there is an exception during InputFormat 
+  * construction
   */
- InputFormat getInputFormat();
+ InputFormat getInputFormat() throws IOException;
  
  /**
+  * This will be called on the front end during planning and not on the 
back 
+  * end during execution.
-  * Return the LoadCaster associated with this loader.  Returning
+  * @return the {...@link LoadCaster} associated with this loader. 
Returning null 
-  * null indicates that casts from byte array are not supported
+  * indicates that casts from byte array are not supported for this 
loader. 
-  * for this loader.  This will be called on the front end during
-  * planning and not on the back end during execution.
+  * construction
+  * @throws IOException if there is an exception during LoadCaster 
   */
- LoadCaster getLoadCaster();
+ LoadCaster getLoadCaster() throws IOException;
  
  /**
   * Initializes LoadFunc for reading data.  This will be called during 
execution
   * before any calls to getNext.  The RecordReader needs to be passed here 
because
   * it has been instantiated for a particular InputSplit.
-  * @param reader RecordReader to be used by this instance of the LoadFunc
+  * @param reader {...@link RecordReader} to be used by this instance of 
the LoadFunc
+  * @param split The input {...@link PigSplit} to process
+  * @throws IOException if there is an exception during initialization
   */
+ void prepareToRead(RecordReader reader, PigSplit split) throws 
IOException;
- void prepareToRe

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by ankit.modi

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
ankit.modi.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=130&rev2=131

--

  ||1106||Merge join is possible only for simple column or '*' join keys when 
using  as the loader ||
  ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||1108||Duplicated schema||
+ ||1109||Input (  ) on which outer join is desired should have a 
valid schema||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||
@@ -467, +468 @@

 1. February 11, 2009: Updated "Compendium of error messages" to include 
new error codes (2116 through 2121, 6015 and 6016)
 1. February 12, 2009: Updated "Compendium of error messages" to include 
new error code 2122
 1. April 10, 2009: Updated "Compendium of error messages" to replace  
error code 2110
+1. November 2, 2009: Updated "Compendium of error messages" to include new 
error code 1109
  == References ==
  
   1. <> "Pig Developer Cookbook" October 21, 2008, 
http://wiki.apache.org/pig/PigDeveloperCookbook

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=9&rev2=10

--

   * Added relativeToAbsolutePath() method in LoadFunc per 
http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818
   * Changed comments in setLocation regarding the location passed - the 
location will now be the return value of relativeToAbsolutePath()
   * setLocation() now also takes a Job argument since the main purpose of this 
call is to an opportunity to the LoadFunc implementation to communicate the 
input location to underlying InputFormat. InputFormat implementations inturn 
seem to be storing this information inthe Job. For example, FileInputFormat has 
the following static method to set the input location: setInputPaths(JobConf 
conf, String commaSeparatedPaths) ;
+  * Removed doneReading() method since there is already a RecordReader.close() 
method which will be called by Hadoop wherein all the functionality that needs 
to be done on completion of reading can be done.
   * All methods now can throw IOException - this keeps the interface more 
flexible for exception cases
  
  In LoadMetadata:

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=10&rev2=11

--

   * @return the absolute location based on the arguments passed
   * @throws IOException if the conversion is not possible
   */
- String relativeToAbsolutePath(String location, String curDir) throws 
IOException;
+ String relativeToAbsolutePath(String location, Path curDir) throws 
IOException;
  
  /**
   * Communicate to the loader the load string used in Pig Latin to refer 
to the

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=11&rev2=12

--

   * 
   * @param location location as provided in the "load" statement of the 
script
   * @param curDir the current working direction based on any "cd" 
statements
-  * in the script before the "load" statement
+  * in the script before the "load" statement. If there are no "cd" 
statements
+  * in the script, this would be the home directory - 
+  * /user/ 
   * @return the absolute location based on the arguments passed
   * @throws IOException if the conversion is not possible
   */
  String relativeToAbsolutePath(String location, Path curDir) throws 
IOException;
+ 
  
  /**
   * Communicate to the loader the load string used in Pig Latin to refer 
to the

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=12&rev2=13

--

  '''!StoreFunc'''
  
  {{{
+ 
+ /**
+ * This interface is used to implement functions to write records
+ * from a dataset.
+ */
+ 
  public interface StoreFunc {
+ 
+ /**
+  * This method is called by the Pig runtime in the front end to convert 
the
+  * output location to an absolute path if the location is relative. The
+  * StoreFunc implementation is free to choose how it converts a relative 
+  * location to an absolute location since this may depend on what the 
location
+  * string represent (hdfs path or some other data source)
+  * 
+  * @param location location as provided in the "store" statement of the 
script
+  * @param curDir the current working direction based on any "cd" 
statements
+  * in the script before the "store" statement. If there are no "cd" 
statements
+  * in the script, this would be the home directory - 
+  * /user/ 
+  * @return the absolute location based on the arguments passed
+  * @throws IOException if the conversion is not possible
+  */
+ String relToAbsPathForStoreLocation(String location, Path curDir) throws 
IOException;
  
  /**
   * Return the OutputFormat associated with StoreFunc.  This will be called
   * on the front end during planning and not on the backend during
-  * execution.  OutputFormat information need not be carried to the back 
end
-  * as the appropriate RecordWriter will be provided to the StoreFunc.
+  * execution. 
+  * @return the {...@link OutputFormat} associated with StoreFunc
+  * @throws IOException if an exception occurs while constructing the 
+  * OutputFormat
-  */
+  *
+  */
- OutputFormat getOutputFormat();
+ OutputFormat getOutputFormat() throws IOException;
  
  /**
   * Communicate to the store function the location used in Pig Latin to 
refer 
@@ -327, +353 @@

   * called during planning on the front end, not during execution on
   * the backend.
   * @param location Location indicated in store statement.
+  * @param job The {...@link Job} object
   * @throws IOException if the location is not valid.
   */
- void setLocation(String location) throws IOException;
+ void setStoreLocation(String location, Job job) throws IOException;
   
  /**
   * Set the schema for data to be stored.  This will be called on the
+  * front end during planning. A Store function should implement this 
function to
-  * front end during planning.  If the store function wishes to record
-  * the schema it will need to carry it to the backend.
-  * Even if a store function cannot
-  * record the schema, it may need to implement this function to
   * check that a given schema is acceptable to it.  For example, it
   * can check that the correct partition keys are included;
   * a storage function to be written directly to an OutputFormat can
   * make sure the schema will translate in a well defined way.  
-  * @param schema to be checked/set
+  * @param s to be checked
-  * @throw IOException if this schema is not acceptable.  It should include
+  * @throws IOException if this schema is not acceptable.  It should 
include
   * a detailed error message indicating what is wrong with the schema.
   */
- void setSchema(ResourceSchema s) throws IOException;
+ void checkSchema(ResourceSchema s) throws IOException;
  
  /**
   * Initialize StoreFunc to write data.  This will be called during
   * execution before the call to putNext.
   * @param writer RecordWriter to use.
+  * @throws IOException if an exception occurs during initialization
   */
- void prepareToWrite(RecordWriter writer);
+ void prepareToWrite(RecordWriter writer) throws IOException;
- 
- /**
-  * Called when all writing is finished.  This will be called on the 
backend,
-  * once for each writing task.
-  */
- void doneWriting();
  
  /**
   * Write a tuple the output stream to which this instance was
   * previously bound.
   * 
-  * @param f the tuple to store.
+  * @param t the tuple to store.
-  * @throws IOException
+  * @throws IOException if an exception occurs during the write
   */
  void putNext(Tuple t) throws IOException;
- 
- /**
-  * Called when writing all of the data is finished.  This can be used
-  * to commit information to a metadata system, clean up tmp files, 
-  * close connections, etc.  This call will be made on the front end
-  * after all back end processing is finished.
-  * @param conf The job configurati

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=13&rev2=14

--

  
  Nov 2 2009, Pradeep Kamath
  
- In LoadFunc:
+ In !LoadFunc:
-  * Added relativeToAbsolutePath() method in LoadFunc per 
http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818
+  * Added relativeToAbsolutePath() method in !LoadFunc per 
http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818
   * Changed comments in setLocation regarding the location passed - the 
location will now be the return value of relativeToAbsolutePath()
-  * setLocation() now also takes a Job argument since the main purpose of this 
call is to an opportunity to the LoadFunc implementation to communicate the 
input location to underlying InputFormat. InputFormat implementations inturn 
seem to be storing this information inthe Job. For example, FileInputFormat has 
the following static method to set the input location: setInputPaths(JobConf 
conf, String commaSeparatedPaths) ;
+  * setLocation() now also takes a Job argument since the main purpose of this 
call is to an opportunity to the !LoadFunc implementation to communicate the 
input location to underlying !InputFormat. !InputFormat implementations inturn 
seem to be storing this information inthe Job. For example, !FileInputFormat 
has the following static method to set the input location: 
setInputPaths(JobConf conf, String commaSeparatedPaths) ;
   * Removed doneReading() method since there is already a RecordReader.close() 
method which will be called by Hadoop wherein all the functionality that needs 
to be done on completion of reading can be done.
   * All methods now can throw IOException - this keeps the interface more 
flexible for exception cases
- In LoadMetadata:
+ In !LoadMetadata:
   * getSchema(), getStatistics() and getPartitionKeys() methods now take a 
location and Configuration argument so that the implementation can use that 
information in returning the information requested.
- In StoreFunc:
+ In !StoreFunc:
-  * Added relativeToAbsolutePath() method per 
http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818
+  * Added relToAbsPathForStoreLocation() method per 
http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818
   * Methods which did not throw IOException now do so to enable exceptions in 
implementations
-  * Removed doneWriting() - same functionality already present in 
RecordWriter.close() and OutputCommitter.commitTask()
+  * Removed doneWriting() - same functionality already present in 
!RecordWriter.close() and !OutputCommitter.commitTask()
   * Changed setSchema() to checkSchema since this method is called only to 
allow StoreFunc to check
-  * Removed allFinished() - same functionality already present in 
OutputCommitter.cleanupJob()
+  * Removed allFinished() - same functionality already present in 
!OutputCommitter.cleanupJob()

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=14&rev2=15

--

  
  }}}
  
- Open questions for !LoadFunc:
-  1. Should setLocation instead be setURI and take a URI?  The advantage of a 
URI is we know exactly what users are trying to communicate with, and we can 
define what Pig does in default cases (when a scheme is not given).  The 
disadvantage is forcing more structure on users and their load functions.
- 
  The '''!LoadCaster''' interface will include bytesToInt, bytesToLong, etc. 
functions
  currently in !LoadFunc.  UTF8!StorageConverter will implement this interface.
+ 
+ Open Question: Should the methods to convert to a Bag, Tuple and Map take a 
Schema (ResourceSchema?) argument? 
+ 
  
  '''!LoadMetadata'''
  {{{

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=15&rev2=16

--

  The '''!LoadCaster''' interface will include bytesToInt, bytesToLong, etc. 
functions
  currently in !LoadFunc.  UTF8!StorageConverter will implement this interface.
  
- Open Question: Should the methods to convert to a Bag, Tuple and Map take a 
Schema (ResourceSchema?) argument? 
+ '''Open Question''': Should the methods to convert to a Bag, Tuple and Map 
take a Schema (ResourceSchema?) argument? 
  
  
  '''!LoadMetadata'''
@@ -425, +425 @@

  result.
  
  Since Pig still needs to add information to !InputSplits, user provided 
!InputFormats and !InputSplits cannot be used directly.  Instead, the
- proposal is to change !PigInputFormat to contain an !InputFormat.  
!PigInputFormat will return !PigInputSplits, each of which contain an
+ proposal is to change !PigInputFormat to represent the job's !InputFormat to 
!Hadoop and internally to handle the complexity of multiple inputs and hence 
multiple !InputFormats.  !PigInputFormat will return !PigSplits each of which 
contain an
- !InputSplit.  In addition, !PigInputSplit will contain the necessary 
information to allow Pig to correctly address tuples to the correct data
+ !InputSplit.  In addition, !PigSplit will contain the necessary information 
to allow Pig to correctly address tuples to the correct data
  processing pipeline.
  
- In order to support arbitrary Hadoop !InputFormats, it will be necessary to 
construct a load function, !InputFormatLoader, that will take an
+ In order to support arbitrary Hadoop !InputFormats, Pig can provide a load 
function, !InputFormatLoader, that will take an
- !InputFormat as a constructor argument.  When asked by Pig which !InputFormat 
to use, it will return the one indicated by the user.  Its call to
+ !InputFormat as a constructor argument.  Only !InputFormats which have zero 
argument constructors can be supported since Pig will try to instantiate the 
supplied !InputFormat using reflection. When asked by Pig which !InputFormat to 
use, it will return the one indicated by the user.  Its call to
  getNext will then take the key and value provided by the associated 
!RecordReader and construct a two field tuple.  These types will be converted
  to Pig types as follows:
  
@@ -445, +445 @@

  || !BooleanWritable || int|| In the future if Pig exposes boolean as 
a first class type, this would change to boolean ||
  || !ByteWritable|| int||  
||
  || !NullWritable|| null   ||  
||
- || All others   || byte array ||  
||
+ || All others   || byte array || How do we construct a byte array from 
arbitrary types?   ||
  
  Since the format of any other types are unknown to Pig and cannot be 
generalized, it does not make sense to provide casts from byte array to pig
  types via a !LoadCaster.  If users wish to use an !InputFormat that uses 
types beyond these and cast them to Pig types, they can extend the
@@ -469, +469 @@

  
  Positioning information in an !InputSplit presents a problem.  Hadoop 0.18 
has a getPos call in the !InputSplit, but it has been removed in 0.20.
  The reason is that input from files can generally be assigned a position, 
though it may not always be
- accurate, as in the bzip case.  But some input formats position may not have 
meaning.  Even if Pig does not switch to using !InputFormats it will
+ accurate, as in the bzip case.  But for some input formats position may not 
have meaning.  Even if Pig does not switch to using !InputFormats it will
  have to deal with this issue, just as MR has.
  
+ These changes will affect the !SamplableLoader interface.  Currently it uses 
skip and getPos to move the underlying stream so that it can pick
+ up a sample of tuples out of a block.  Since it would sit atop !InputFormat 
it would no longer have access to the underlying stream.  It would be
+ changed instead to skip a number of tuples. 
+ 
  However, in some places Pig needs this position information.  In particular, 
when building an index for a merge join, Pig needs a way to mark a
+ location in an input while building the index and then return to that 
position during the join. In this new proposal, the merge join index will 
contain filename and split index (index of the split in the List returned by 
InputFormat.getSplits()). The merge join code at run time will then seek to the 
right split in the file and process from that split on. For this to work th

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=16&rev2=17

--

   1. How will we worked with compressed files?  !FileInputFormat already works 
with bzip and gzip compressed files, producing reasonable splits.  !PigStorage 
will be reworked to depend on !FileInputFormat (or a descendant thereof, see 
next item) and should therefore be able to use this functionality. Currently 
Pig supports gz/bzip for arbitrary loadfunc/storefunc combinations. With this 
proposal, gz/bzip format will only be supported for load/store using PigStorage.
  
  
- === Implementation details and status ===
+ == Implementation details and status ==
  
-  Current status 
+ === Current status ===
  A branch -'load-store-redesign' 
(http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has 
been created to undertake work on this proposal. As of today (Nov 2. 2009) this 
branch has simple load-store working for PigStorage and BinStorage. Joins on 
multiple inputs and multi store queries with multi query optimization also 
work. Some of the recent changes in the proposal above (the changes noted under 
Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not 
be comprehensive) of remaining tasks is listed in a subsection below.
  
-  Notes on implementation details 
+ === Notes on implementation details ===
+ This section is to document changes made at a high level to give an overall 
connected picture which code comments may not provide. 
  
+  Changes to work with Hadoop !InputFormat model 
+ 
+  Changes to work with Hadoop !OutputFormat model 
+ 
-  Remaining Tasks 
+ === Remaining Tasks ===
-  * BinStorage needs to implement LoadMetadata's getSchema() to replace 
current determineSchema()
+  * !BinStorage needs to implement !LoadMetadata's getSchema() to replace 
current determineSchema()
   * piggybank loaders/storers need to be ported
-  * fix lineage code to use LoadCaster instead of LoadFunc
+  * fix lineage code to use !LoadCaster instead of !LoadFunc
   * local mode needs to be ported
-  * PigDump needs to be ported
+  * !PigDump needs to be ported
-  * poload needs to be ported
+  * !POLoad needs to be ported
   * Need to handle passing loadfunc specific info between different instances 
of loadfunc (Different instances in front end and 
  between front end and back end - we need what is required in PIG-602) 
(setPartitionFilter() and pushOperators()for example needs 
  this - these methods are called in the front end but the information passed 
is needed in the backend)
-  * For ResourceSchema to be effectively used for communicating schema, we 
must fix the two level access issues with 
+  * For !ResourceSchema to be effectively used for communicating schema, we 
must fix the two level access issues with 
  schema of bags in current schema before we make these changes, otherwise that 
same contagion will afflict us here. 
   * Input/Output handler code in streaming needs to be ported 
   * split by file will have to removed from language
   * fix code with FIXME in comment relating to load-store redesign
-  * Decide on what we should do with ReversibleLoadFunc and multiquery 
optimization
+  * Decide on what we should do with !ReversibleLoadFunc and multiquery 
optimization

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=17&rev2=18

--

  === Notes on implementation details ===
  This section is to document changes made at a high level to give an overall 
connected picture which code comments may not provide. 
  
-  Changes to work with Hadoop !InputFormat model 
+  Changes to work with Hadoop InputFormat model 
+ Hadoop has the notion of a single InputFormat per job. This is restrictive 
since Pig processes multiple inputs in the same map reduce job (in the case of 
Join, Union or Cogroup). This is handled by !PigInputFormat which is the 
!InputFormat Pig communicates to Hadoop as the Job's !InputFormat.
  
   Changes to work with Hadoop !OutputFormat model 
  
@@ -530, +531 @@

   * fix lineage code to use !LoadCaster instead of !LoadFunc
   * local mode needs to be ported
   * !PigDump needs to be ported
-  * !POLoad needs to be ported
+  * POLoad needs to be ported
+  * Need to handle passing loadfunc specific info between different instances 
of loadfunc (Different instances in front end and between front end and back 
end - we need what is required in PIG-602) (setPartitionFilter() and 
pushOperators()for example needs 
-  * Need to handle passing loadfunc specific info between different instances 
of loadfunc (Different instances in front end and 
- between front end and back end - we need what is required in PIG-602) 
(setPartitionFilter() and pushOperators()for example needs 
  this - these methods are called in the front end but the information passed 
is needed in the backend)
+  * For !ResourceSchema to be effectively used for communicating schema, we 
must fix the two level access issues with schema of bags in current schema 
before we make these changes, otherwise that same contagion will afflict us 
here. 
-  * For !ResourceSchema to be effectively used for communicating schema, we 
must fix the two level access issues with 
- schema of bags in current schema before we make these changes, otherwise that 
same contagion will afflict us here. 
   * Input/Output handler code in streaming needs to be ported 
   * split by file will have to removed from language
   * fix code with FIXME in comment relating to load-store redesign
   * Decide on what we should do with !ReversibleLoadFunc and multiquery 
optimization
- 
- 
  
  == Changes ==
  Sept 23 2009, Gates

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=24&rev2=25

--

   * invoke !LoadFunc.setLocation()
   * Call getInputFormat() on the !LoadFunc and then createRecordReader() on 
the !InputFormat returned. Note that the above setLocation call needs to happen 
*before* the createRecordReader() call and the createRecordReader() call needs 
to be given a !TaskAttemptContext built out of the "updated (with location)" 
Configuration.
   * Wrap the !RecordReader returned above in !PigRecordReader class which is 
returned to Hadoop as the !RecordReader. !PigRecordReader has Text as key type 
(which is always sent with a null value to Hadoop since in pig, we really do 
not extract a key from input records) and a Tuple as a the value type (which is 
a tuple constructed from the input record). 
+ 
+ '''Open Question''': - We are hoping that !LoadFunc actually sets up the 
input location on the conf in the setLocation() call - and then using that conf 
in createRecordReader() call - what if it does this in getInputFormat()?
  
   Changes to work with Hadoop OutputFormat model 
  Hadoop has the notion of a single !OutputFormat per job. !PigOutputFormat is 
the class indicated by Pig as the !OutputFormat for map reduce jobs compiled 
from pig scripts.

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=25&rev2=26

--

   * split by file will have to removed from language
   * fix code with FIXME in comment relating to load-store redesign
   * Decide on what we should do with !ReversibleLoadFunc and multiquery 
optimization
+  * Address any '''Open Question'''s in this document
  
  == Changes ==
  Sept 23 2009, Gates

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=21&rev2=22

--

  Hadoop has the notion of a single InputFormat per job. This is restrictive 
since Pig processes multiple inputs in the same map reduce job (in the case of 
Join, Union or Cogroup). This is handled by !PigInputFormat which is the 
!InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In 
!PigInputFormat.getSplits(), the implementation processes each input in the 
following manner:
  
   * Instantiate the !LoadFunc associated with the input
-  * Make a clone of the Configuration passed in the getSplits() call and then 
invoke !LoadFunc.setlocation() using the clone. The reason a clone is necessary 
here is because generally in the setlocation() method, the loadfunc would 
communicate the location to its underlying !InputFormat. Typically 
!InputFormats store the location into the Configuration for use in the 
getSplits() call. For example, !FileInputFormat does this through 
!FileInputFormat.setInputLocation(Job job, String location). We don't updates 
to the Configuration for different inputs to over-write each other - hence the 
clone.
+  * Make a clone of the Configuration passed in the getSplits() call and then 
invoke !LoadFunc.setlocation() using the clone. The reason a clone is necessary 
here is because generally in the setlocation() method, the loadfunc would 
communicate the location to its underlying !InputFormat. Typically 
!InputFormats store the location into the Configuration for use in the 
getSplits() call. For example, !FileInputFormat does this through 
!FileInputFormat.setInputPaths(Job job, String location). We don't want updates 
to the Configuration for different inputs to over-write each other - hence the 
clone.
   * Call getInputFormat() on the !LoadFunc and then getSplits() on the 
!InputFormat returned. Note that the above setLocation call needs to happen 
*before* the getSplits() call and the getSplits() call needs to be given a 
!JobContext built out of the "updated (with location)" cloned Configuration.
   * Wrap each returned !InputSplit in !PigSplit to store information like the 
list of target operators (the pipeline) for this input, the index of the split 
in the List of Splits returned by getSplits (this is used during merge join 
index creation) etc (comments in PigSplit explain the members)
  
@@ -535, +535 @@

   * Instantiate the !LoadFunc associated with input represented by the 
PigSplit passed into !PigInputFormat.createRecordReader()
   * invoke !LoadFunc.setLocation()
   * Call getInputFormat() on the !LoadFunc and then createRecordReader() on 
the !InputFormat returned. Note that the above setLocation call needs to happen 
*before* the createRecordReader() call and the createRecordReader() call needs 
to be given a !TaskAttemptContext built out of the "updated (with location)" 
Configuration.
+  * Wrap the !RecordReader returned above in !PigRecordReader class which is 
returned to Hadoop as the !RecordReader. !PigRecordReader has Text as key type 
(which is always sent with a null value to Hadoop since in pig, we really do 
not extract a key from input records) and a Tuple as a the value type (which is 
a tuple constructed from the input record). 
  
   Changes to work with Hadoop OutputFormat model 
+ Hadoop has the notion of a single !OutputFormat per job. !PigOutputFormat is 
the class indicated by Pig as the !OutputFormat for map reduce jobs compiled 
from pig scripts. 
+ 
+ In !PigOutputFormat.checkOutputSpecs(), the implementation iterates over 
POStore(s) in the map and reduce phases and for each such store does the 
following:
+  * Instantiate the !StoreFunc associated with the POStore
+  * Make a clone of the JobContext passed in 
!PigOutputFormat.checkOutputSpecs() call and then invoke 
!StoreFunc.setStoreLocation() using the clone. The reason a clone is necessary 
here is because generally in the setStorelocation() method, the StoreFunc would 
communicate the location to its underlying !OutputFormat. Typically 
!OutputFormats store the location into the Configuration for use in the 
checkOutputSpecs() call. For example, !FileOutputFormat does this through 
!FileOutputFormat.setOutputPath(Job job, Path location). We don't want updates 
to the Configuration for different outputs to over-write each other - hence the 
clone.
+  * Call getOutputFormat() on the !StoreFunc and then checkOutputSpecs() on 
the !OutputFormat returned. Note that the above setStoreLocation call needs to 
happen *before* the checkOutputSpecs() call and the checkOutputSpecs() call 
needs to be given the "updated (with location)" cloned JobContext.
+ 
  
  === Remaining Tasks ===
   * !BinStorage needs to implement !LoadMetadata's getSchema() to repl

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=22&rev2=23

--

   Changes to work with Hadoop OutputFormat model 
  Hadoop has the notion of a single !OutputFormat per job. !PigOutputFormat is 
the class indicated by Pig as the !OutputFormat for map reduce jobs compiled 
from pig scripts. 
  
- In !PigOutputFormat.checkOutputSpecs(), the implementation iterates over 
POStore(s) in the map and reduce phases and for each such store does the 
following:
+ In !PigOutputFormat.checkOutputSpecs(), the implementation iterates over 
!POStore(s) in the map and reduce phases and for each such store does the 
following:
-  * Instantiate the !StoreFunc associated with the POStore
+  * Instantiate the !StoreFunc associated with the !POStore
-  * Make a clone of the JobContext passed in 
!PigOutputFormat.checkOutputSpecs() call and then invoke 
!StoreFunc.setStoreLocation() using the clone. The reason a clone is necessary 
here is because generally in the setStorelocation() method, the StoreFunc would 
communicate the location to its underlying !OutputFormat. Typically 
!OutputFormats store the location into the Configuration for use in the 
checkOutputSpecs() call. For example, !FileOutputFormat does this through 
!FileOutputFormat.setOutputPath(Job job, Path location). We don't want updates 
to the Configuration for different outputs to over-write each other - hence the 
clone.
+  * Make a clone of the !JobContext passed in 
!PigOutputFormat.checkOutputSpecs() call and then invoke 
!StoreFunc.setStoreLocation() using the clone. The reason a clone is necessary 
here is because generally in the setStorelocation() method, the !StoreFunc 
would communicate the location to its underlying !OutputFormat. Typically 
!OutputFormats store the location into the Configuration for use in the 
checkOutputSpecs() call. For example, !FileOutputFormat does this through 
!FileOutputFormat.setOutputPath(Job job, Path location). We don't want updates 
to the Configuration for different outputs to over-write each other - hence the 
clone.
   * Call getOutputFormat() on the !StoreFunc and then checkOutputSpecs() on 
the !OutputFormat returned. Note that the above setStoreLocation call needs to 
happen *before* the checkOutputSpecs() call and the checkOutputSpecs() call 
needs to be given the "updated (with location)" cloned JobContext.
+ 
+ !PigOutputFormat.getOutputCommitter() returns a !PigOutputCommitter object. 
The !PigOutputCommitter internally keeps a list of OutputCommitters 
corresponding to !OutputFormat of !StoreFunc(s) in the POStore(s) in the map 
and reduce phases. It delegates all calls in the OutputCommitter class invoked 
by Hadoop to calls on the appropriate underlying committers.
+ 
+ The other method in !OutputFormat is the getRecordWriter() method. In the 
single store case !PigOutputFormat.getRecordWriter() does the following:
+  * Instantiate the !StoreFunc associated with single !POStore.
+  * invoke !StoreFunc.setStoreLocation()
+  * Call getOutputFormat() on the !StoreFunc and then getRecordWriter() on the 
!OutputFormat returned. Note that the above setStoreLocation call needs to 
happen *before* the getRecordWriter() call and the getRecordWriter() call needs 
to be given a !TaskAttemptContext which has the "updated (with location)" 
Configuration.
+  * Wrap the !RecordWriter returned above in !PigRecordWriter class which is 
returned to Hadoop as the !RecordWriter. !PigRecordReader has 
WritableComparable as key type (which is always sent with a null value when we 
write, since in pig, we really do not have a key to store in the output( and a 
Tuple as a the value type (which is the output tuple).
+ 
+ For the multi query optimized multi store case, there are multiple !POStores 
in the same map reduce job. In this case, the data is written out in the Pig 
map or reduce pipeline itself through the POStore operator. Details of this can 
be found in http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - 
"Internal Changes" section - "Store Operator" subsection. So from the pig 
runtime code, we never call Context.write() (which would have internally called 
PigRecordWriter.write()). So the handling of multi stores has not changed for 
writing data out for this redesign.
  
  
  === Remaining Tasks ===

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=19&rev2=20

--

  == Implementation details and status ==
  
  === Current status ===
- A branch -'load-store-redesign' 
(http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has 
been created to undertake work on this proposal. As of today (Nov 2. 2009) this 
branch has simple load-store working for PigStorage and BinStorage. Joins on 
multiple inputs and multi store queries with multi query optimization also 
work. Some of the recent changes in the proposal above (the changes noted under 
Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not 
be comprehensive) of remaining tasks is listed in a subsection below.
+ A branch -'load-store-redesign' 
(http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has 
been created to undertake work on this proposal. As of today (Nov 2. 2009) this 
branch has simple load-store working for !PigStorage and !BinStorage. Joins on 
multiple inputs and multi store queries with multi query optimization also 
work. Some of the recent changes in the proposal above (the changes noted under 
Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not 
be comprehensive) of remaining tasks is listed in a subsection below.
  
  === Notes on implementation details ===
  This section is to document changes made at a high level to give an overall 
connected picture which code comments may not provide. 
  
   Changes to work with Hadoop InputFormat model 
- Hadoop has the notion of a single InputFormat per job. This is restrictive 
since Pig processes multiple inputs in the same map reduce job (in the case of 
Join, Union or Cogroup). This is handled by !PigInputFormat which is the 
!InputFormat Pig communicates to Hadoop as the Job's !InputFormat.
+ Hadoop has the notion of a single InputFormat per job. This is restrictive 
since Pig processes multiple inputs in the same map reduce job (in the case of 
Join, Union or Cogroup). This is handled by !PigInputFormat which is the 
!InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In 
PigInputFormat.getSplits(), the implementation processes each input in the 
following manner:
+ 
+  * Instantiate the LoadFunc associated with the input
+  * Make a clone of the Configuration passed in the getSplits() call and then 
invoke LoadFunc.setlocation() using the clone. The reason a clone is necessary 
here is because generally in the setlocation() method, the loadfunc would 
communicate the location to its underlying !InputFormat. Typically 
!InputFormats store the location into the Configuration for use in the 
getSplits() call. For example, !FileInputFormat does this through 
!FileInputFormat.setInputLocation(Job job, String location). We don't updates 
to the Configuration for different inputs to over-write each other - hence the 
clone.
+  * Call getInputFormat() on the !LoadFunc and then getSplits() on the 
!InputFormat returned. Note that the above setLocation call needs to happen 
*before* the getSplits() call and the getSplits() call needs to be given a 
!JobContext built out of the "updated (with location)" cloned Configuration.
+  * Wrap each returned !InputSplit in !PigSplit to store information like the 
list of target operators (the pipeline) for this input, the index of the split 
in the List of Splits returned by getSplits (this is used during merge join 
index creation) etc (comments in PigSplit explain the members)
+ 
+ The list of target operators helps pig give the tuples from an input to the 
correct part of the pipeline in a multi input pipeline (like in join, cogroup, 
union).
+ 
+ The other method in !InputFormat is createRecordReader which needs be given a 
!TaskAttemptContext. The Configuration present in the !TaskAttemptContext needs 
to have any information that might have been put into it as a result of the 
above LoadFunc.setLocation() call. However the !PigInputFormat.getSplits() 
method is called in the front-end by Hadoop and 
!PigInputFormat.createRecordReader() is called in the back-end. So we would 
need to somehow pass a Map between input and the input specific Configuration 
(updated with location and other information from the relevant 
LoadFunc.setLocation() call) from the front end to the back-end. One way to 
pass this map would be in the Configuration of the !JobContext passed to 
!PigInputFormat.getSplits(). However in Hadoop 0.20.1 this Configuration 
present in the !JobContext passed to !PigInputFormat.getSplits() is a copy of 
the Configuration which is serialized to the backend and used to create the 
!TaskAttemptContext passed in !PigInputFormat.createRecordReader(). Hence 
passing the map this way is not p

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=20&rev2=21

--

  This section is to document changes made at a high level to give an overall 
connected picture which code comments may not provide. 
  
   Changes to work with Hadoop InputFormat model 
- Hadoop has the notion of a single InputFormat per job. This is restrictive 
since Pig processes multiple inputs in the same map reduce job (in the case of 
Join, Union or Cogroup). This is handled by !PigInputFormat which is the 
!InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In 
PigInputFormat.getSplits(), the implementation processes each input in the 
following manner:
+ Hadoop has the notion of a single InputFormat per job. This is restrictive 
since Pig processes multiple inputs in the same map reduce job (in the case of 
Join, Union or Cogroup). This is handled by !PigInputFormat which is the 
!InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In 
!PigInputFormat.getSplits(), the implementation processes each input in the 
following manner:
  
-  * Instantiate the LoadFunc associated with the input
+  * Instantiate the !LoadFunc associated with the input
-  * Make a clone of the Configuration passed in the getSplits() call and then 
invoke LoadFunc.setlocation() using the clone. The reason a clone is necessary 
here is because generally in the setlocation() method, the loadfunc would 
communicate the location to its underlying !InputFormat. Typically 
!InputFormats store the location into the Configuration for use in the 
getSplits() call. For example, !FileInputFormat does this through 
!FileInputFormat.setInputLocation(Job job, String location). We don't updates 
to the Configuration for different inputs to over-write each other - hence the 
clone.
+  * Make a clone of the Configuration passed in the getSplits() call and then 
invoke !LoadFunc.setlocation() using the clone. The reason a clone is necessary 
here is because generally in the setlocation() method, the loadfunc would 
communicate the location to its underlying !InputFormat. Typically 
!InputFormats store the location into the Configuration for use in the 
getSplits() call. For example, !FileInputFormat does this through 
!FileInputFormat.setInputLocation(Job job, String location). We don't updates 
to the Configuration for different inputs to over-write each other - hence the 
clone.
   * Call getInputFormat() on the !LoadFunc and then getSplits() on the 
!InputFormat returned. Note that the above setLocation call needs to happen 
*before* the getSplits() call and the getSplits() call needs to be given a 
!JobContext built out of the "updated (with location)" cloned Configuration.
   * Wrap each returned !InputSplit in !PigSplit to store information like the 
list of target operators (the pipeline) for this input, the index of the split 
in the List of Splits returned by getSplits (this is used during merge join 
index creation) etc (comments in PigSplit explain the members)
  
  The list of target operators helps pig give the tuples from an input to the 
correct part of the pipeline in a multi input pipeline (like in join, cogroup, 
union).
  
- The other method in !InputFormat is createRecordReader which needs be given a 
!TaskAttemptContext. The Configuration present in the !TaskAttemptContext needs 
to have any information that might have been put into it as a result of the 
above LoadFunc.setLocation() call. However the !PigInputFormat.getSplits() 
method is called in the front-end by Hadoop and 
!PigInputFormat.createRecordReader() is called in the back-end. So we would 
need to somehow pass a Map between input and the input specific Configuration 
(updated with location and other information from the relevant 
LoadFunc.setLocation() call) from the front end to the back-end. One way to 
pass this map would be in the Configuration of the !JobContext passed to 
!PigInputFormat.getSplits(). However in Hadoop 0.20.1 this Configuration 
present in the !JobContext passed to !PigInputFormat.getSplits() is a copy of 
the Configuration which is serialized to the backend and used to create the 
!TaskAttemptContext passed in !PigInputFormat.createRecordReader(). Hence 
passing the map this way is not possible. Hence we re-create the side effects 
of the !LoadFunc.setLocation() call in !PigInputFormat.getSplits() in 
!PigInputFormat.createRecordReader() by the following sequence:
+ The other method in !InputFormat is createRecordReader which needs be given a 
!TaskAttemptContext. The Configuration present in the !TaskAttemptContext needs 
to have any information that might have been put into it as a result of the 
above !LoadFunc.setLocation() call. However the !PigInputFormat.getSplits() 
method is cal

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=23&rev2=24

--

  == Implementation details and status ==
  
  === Current status ===
+ https://issues.apache.org/jira/browse/PIG-966 is the main JIRA to track 
progress. A branch -'load-store-redesign' 
(http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has 
been created to undertake work on this proposal. 
+ 
- A branch -'load-store-redesign' 
(http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has 
been created to undertake work on this proposal. As of today (Nov 2. 2009) this 
branch has simple load-store working for !PigStorage and !BinStorage. Joins on 
multiple inputs and multi store queries with multi query optimization also 
work. Some of the recent changes in the proposal above (the changes noted under 
Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not 
be comprehensive) of remaining tasks is listed in a subsection below.
+ Status on Nov 2. 2009: This branch has simple load-store working for 
!PigStorage and !BinStorage. Joins on multiple inputs and multi store queries 
with multi query optimization also work. Some of the recent changes in the 
proposal above (the changes noted under Nov 2. 2009 in the Changes below) have 
not been incorporated. A list (may not be comprehensive) of remaining tasks is 
listed in a subsection below.
  
  === Notes on implementation details ===
  This section is to document changes made at a high level to give an overall 
connected picture which code comments may not provide.

[Pig Wiki] Trivial Update of "PigMix" by DmitriyRyaboy

2009-11-04 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigMix" page has been changed by DmitriyRyaboy.
The comment on this change is: added back-dated weighted averages.
http://wiki.apache.org/pig/PigMix?action=diff&rev1=12&rev2=13

--

  || L12 multi-store  || 150  || fails|| 
781   || 499  || 804 ||
  || Total time   || 1791 || 13638|| 
4420  || 3284 || 2950||
  || Compared to hadoop   || 1.0  || 7.6  || 
2.5   || 1.8  || 1.6 ||
+ || Weighted Average || 1.0  || 11.2 || 
3.26  || 2.20 || 1.97||
  
  The totb run of 1/20/09 includes the change to make 
!BufferedPositionedInputStream use a buffer instead of relying on hadoop to 
buffer.
  
@@ -60, +61 @@

  || L12 multi-store  || 139  || 159 ||
  || Total time   || 1826 || 2764||
  || Compared to hadoop   || N/A  || 1.5 ||
- 
+ || Weighted average || N/A  || 1.83||
  
  Run date:  June 28, 2009, run against top of trunk as of that day.
  Note that the columns got reversed in this one (Pig then MR)
  
- || Test || Pig run time || Java run time || Multiplier ||
+ || Test || Pig run time || Java run time || Multiplier || 
  || PigMix_1 || 204 || 117.33 || 1.74 ||
  || PigMix_2 || 110.33 || 50.67 || 2.18 ||
  || PigMix_3 || 292.33 || 125 || 2.34 ||
@@ -79, +80 @@

  || PigMix_11 || 206.33 || 136.67 || 1.51 ||
  || PigMix_12 || 173 || 161.67 || 1.07 ||
  || Total || 2729.67 || 1948.33 || 1.40 ||
+ || Weighted avg || || || 1.68 ||
  
  Run date:  August 27, 2009, run against top of trunk as of that day.
  
@@ -96, +98 @@

  || PigMix_11 || 180 || 121 || 1.49 ||
  || PigMix_12 || 156 || 160.67 || 0.97 ||
  || Total || 2440.67 || 2001.67 || 1.22 ||
+ || Weighted avg || || || 1.53 ||
  
  Run date:  October 18, 2009, run against top of trunk as of that day.
  With this run we included a new measure, weighted average.  Our previous 
multiplier that we have been publishing takes the total time of running all 12 
Pig Latin scripts and compares it to the total time of
@@ -118, +121 @@

  || PigMix_11 || 145.33   || 168.67|| 0.86   ||
  || PigMix_12 || 55.33|| 95.33 || 0.58   ||
  || Total || 1352.33  || 1357  || 1.00   ||
- Weighted Average:  1.04
+ || Weighted avg ||  ||   || 1.04   ||
  
  
  == Features Tested ==

[Pig Wiki] Update of "PigTalksPapers" by DmitriyRyaboy

2009-11-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigTalksPapers" page has been changed by DmitriyRyaboy.
http://wiki.apache.org/pig/PigTalksPapers?action=diff&rev1=7&rev2=8

--

  * Pig:  Making Hadoop Easy, talk at !ApacheCon US 2008:  
[[http://wiki.apache.org/pig/ApacheConUS2008|ApacheConUS2008]]
  * Pig:  Making Hadoop Easy, talk at !ApacheCon EU 2009:  
[[attachment:ApacheConEurope09.ppt|ApacheConEU2009]]
  * Pig talk given at 2009 Hadoop Summit 
[[attachment:HadoopSummit2009.ppt|HadoopSummit2009]]
+ 
+ * Pig usage at Twitter, a presentation from NoSQL East 
[[http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009|slides]]
+ * Pig talk for Pittsburgh HUG: intro, explanation of joins, research 
ideas 
[[http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/|slides]]
  
  == Pig Papers ==
  * Pig paper at VLDB 2009: 
[[http://infolab.stanford.edu/~olston/publications/vldb09.pdf|pdf]]

New attachment added to page PigTalksPapers on Pig Wiki

2009-11-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page "PigTalksPapers" for change notification. An 
attachment has been added to that page by AlanGates. Following detailed 
information is available:

Attachment name: apacheconus2009.pptx
Attachment size: 337661
Attachment link: 
http://wiki.apache.org/pig/PigTalksPapers?action=AttachFile&do=get&target=apacheconus2009.pptx
Page link: http://wiki.apache.org/pig/PigTalksPapers

New attachment added to page PigTalksPapers on Pig Wiki

2009-11-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page "PigTalksPapers" for change notification. An 
attachment has been added to that page by AlanGates. Following detailed 
information is available:

Attachment name: vldb_presentation.pptx
Attachment size: 351814
Attachment link: 
http://wiki.apache.org/pig/PigTalksPapers?action=AttachFile&do=get&target=vldb_presentation.pptx
Page link: http://wiki.apache.org/pig/PigTalksPapers

[Pig Wiki] Update of "PigTalksPapers" by AlanGates

2009-11-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigTalksPapers" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigTalksPapers?action=diff&rev1=8&rev2=9

--

  * Pig:  Making Hadoop Easy, talk at !ApacheCon US 2008:  
[[http://wiki.apache.org/pig/ApacheConUS2008|ApacheConUS2008]]
  * Pig:  Making Hadoop Easy, talk at !ApacheCon EU 2009:  
[[attachment:ApacheConEurope09.ppt|ApacheConEU2009]]
  * Pig talk given at 2009 Hadoop Summit 
[[attachment:HadoopSummit2009.ppt|HadoopSummit2009]]
- 
  * Pig usage at Twitter, a presentation from NoSQL East 
[[http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009|slides]]
  * Pig talk for Pittsburgh HUG: intro, explanation of joins, research 
ideas 
[[http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/|slides]]
+ * Pig talk at !ApacheCon US 2009:  
[[attachment:apacheconus2009.pptx|slides]]
  
  == Pig Papers ==
- * Pig paper at VLDB 2009: 
[[http://infolab.stanford.edu/~olston/publications/vldb09.pdf|pdf]]
+ * Pig paper at VLDB 2009: 
[[http://infolab.stanford.edu/~olston/publications/vldb09.pdf|pdf]], 
[[attachment:vldb_presentation.pptx|slides]] from the associated talk.
  * Pig Latin paper at SIGMOD 2008: 
[[http://infolab.stanford.edu/~olston/publications/sigmod08.pdf|pdf]]
  * Pig optimization paper at USENIX 2008: 
[[http://infolab.stanford.edu/~olston/publications/usenix08.pdf|pdf]]

[Pig Wiki] Update of "PiggyBank" by FlipKromer

2009-11-09 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PiggyBank" page has been changed by FlipKromer.
The comment on this change is: More detail on the CLASSPATH -- need to have 
hadoop and commons-logging jar's in there too.
http://wiki.apache.org/pig/PiggyBank?action=diff&rev1=13&rev2=14

--

  = Piggy Bank - User Defined Pig Functions =
- 
  This is a place for Pig users to share their functions. The functions are 
contributed "as-is". If you find a bug or if you feel a function is missing, 
take the time to fix it or write it yourself and contribute the changes.
  
  <>
+ 
  == Using Functions ==
- 
  To see how to use your own functions in a pig script, please, see the 
[[http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm|Pig
 Latin Reference Manual]]. Note that only JAVA functions are supported at this 
time.
  
  The functions are currently distributed in source form. Users are required to 
checkout the code and build the package themselves.  No binary distributions or 
nightly builds are available at this time.
@@ -14, +13 @@

  To build a jar file that contains all available user defined functions 
(UDFs), please follow the steps:
  
   1. Checkout UDF code: `svn co 
http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank`
-  2. Add pig.jar to your ClassPath : `export 
CLASSPATH=$CLASSPATH:/path/to/pig.jar`
+  1. Add pig.jar to your ClassPath : `export 
CLASSPATH=$CLASSPATH:/path/to/pig.jar`
-  3. Build the jar file: from `trunk/contrib/piggybank/java` directory run 
`ant`. This will generate `piggybank.jar` in the same directory.
+  1. Build the jar file: from `trunk/contrib/piggybank/java` directory run 
`ant`. This will generate `piggybank.jar` in the same directory.
  
+ Make sure your classpath includes the hadoop jars as well. This workedforme 
using the cloudera CDH2 / hadoop AMIs:
  
+ {{{
+ pig_version=0.4.99.0+10   ; pig_dir=/usr/lib/pig ;
+ hadoop_version=0.20.1+152 ; hadoop_dir=/usr/lib/hadoop ;
+ export 
CLASSPATH=$CLASSPATH:${hadoop_dir}/hadoop-${hadoop_version}-core.jar:${hadoop_dir}/hadoop-${hadoop_version}-tools.jar:${hadoop_dir}/hadoop-${hadoop_version}-ant.jar:${hadoop_dir}/lib/commons-logging-1.0.4.jar:${pig_dir}/pig-${pig_version}-core.jar
+ }}}
  To obtain `javadoc` description of the functions run `ant javadoc` from 
`trunk/contrib/piggybank/java` directory. The documentation is generate in 
`trunk/contrib/piggybank/java/build/javadoc` directory.
-  
+ 
  To use a function, you need to figure out which package it belongs to. The 
top level packages correspond to the function type and currently are:
  
-   * org.apache.pig.piggybank.comparison - for custom comparator used by ORDER 
operator
+  * org.apache.pig.piggybank.comparison - for custom comparator used by ORDER 
operator
-   * org.apache.pig.piggybank.evaluation - for eval functions like aggregates 
and column transformations
+  * org.apache.pig.piggybank.evaluation - for eval functions like aggregates 
and column transformations
-   * org.apache.pig.piggybank.filtering - for functions used in FILTER operator
+  * org.apache.pig.piggybank.filtering - for functions used in FILTER operator
-   * org.apache.pig.piggybank.grouping - for grouping functions
+  * org.apache.pig.piggybank.grouping - for grouping functions
-   * org.apache.pig.piggybank.storage - for load/store functions
+  * org.apache.pig.piggybank.storage - for load/store functions
  
  (The exact package of the function can be seen in the javadocs or by 
navigating the source tree.)
  
@@ -37, +42 @@

  TweetsInaug  = FILTER Tweets BY 
org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES 
'.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
  STORE TweetsInaug INTO 'meta/inaug/tweets_inaug' ;
  }}}
- 
- 
  == Contributing Functions ==
- 
  For details on how to create UDFs, please, see the 
[[http://wiki.apache.org/pig/UDFManual|UDF Manual]]. Note that only JAVA 
functions are supported at this time.
  
  To contribute a new function, please, follow the steps:
  
   1. Check existing javadoc to make sure that the function does not already 
exist as described in [[#Using_Functions]]
-  2. Checkout UDF code as described in [[#Using_Functions]]
+  1. Checkout UDF code as described in [[#Using_Functions]]
-  3. Place your java code in the directory that makes sense for your function. 
The directory structure as of now has two levels: function type as described in 
[[#Using_Functions]] and function subtype (like math or string for eval 
functions) for some of the types. If you feel that your function requires a new 
subtype, feel free to add one.
+  1. Place your java code in the directory that makes sense for your function. 
The directory structure as of now has two levels: function type as described in 
[[#Using_Functions]] and function subtype (like math or string for eval 
functions) for some of the types. If you feel t

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-11-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=131&rev2=132

--

  ||2199||LOLoad: schema mismatch||
  ||2200||PruneColumns: Error getting top level project||
  ||2201||Could not validate schema alias||
+ ||2202||Error change distinct/sort to use secondary key optimizer||
+ ||2203||Sort on columns from different inputs||
+ ||2204||Error setting secondary key plan||
+ ||2205||Error visiting POForEach inner plan||
+ ||2206||Error visiting POSort inner plan||
+ ||2207||POForEach inner plan has more than 1 root||
+ ||2208||Exception visiting foreach inner plan||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||

[Pig Wiki] Trivial Update of "LoadStoreRedesignProposal " by DmitriyRyaboy

2009-11-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by DmitriyRyaboy.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=28&rev2=29

--

  // Probably more in here
  }
  
- public long mBytes; // size in megabytes
+ public long mBytes; // "disk" size in megabytes (file size or 
equivalent)
  public long numRecords;  // number of records
  public ResourceFieldStatistics[] fields;
  
@@ -608, +608 @@

  
  Added a new section 'Implementation details and status'
  
+ Nov 11, Dmitriy Ryaboy
+  Minor clarification of meaning of mBytes in !ResourceStatistics
+

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=29&rev2=30

--

  
  
  /**
-  * Communicate to the loader the load string used in Pig Latin to refer 
to the 
-  * object(s) being loaded.  The location string passed to the LoadFunc 
here 
+  * Communicate to the loader the location of the object(s) being loaded.  
+  * The location string passed to the LoadFunc here is the return value of 
-  * is the return value of {...@link 
LoadFunc#relativeToAbsolutePath(String, String)}
+  * {...@link LoadFunc#relativeToAbsolutePath(String, String)}  
   * 
   * This method will be called in the backend multiple times. 
Implementations
   * should bear in mind that this method is called multiple times and 
should

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath

2009-11-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=30&rev2=31

--

  }
  
  public ResourceFieldSchema[] fields;
- public Map byName;
  
  enum Order { ASCENDING, DESCENDING }
  public int[] sortKeys; // each entry is an offset into the fields 
array.

[Pig Wiki] Update of "LoadStoreRedesignProposal" by The jasNair

2009-11-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by ThejasNair.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=31&rev2=32

--

  
   Mechanism to read side files 
  Pig needs to read side files in many places like in Merge Join, Order by, 
Skew join, dump etc. To facilitate doing this in an easy manner, a utility 
!LoadFunc called !ReadToEndLoader has been introduced. Though this has been 
implemented as a !LoadFunc, the only !LoadFunc method which is truly 
implemented is getNext(). The usage pattern is to construct an instance using 
the constructor which would take a reference to the true !LoadFunc (which can 
read the side file data) and then repeatedly call getNext() till null is 
encountered in the return value. The implementation of !ReadToEndLoader hides 
the actions of getting !InputSplits from the underlying !InputFormat and then 
processing each split by getting the !RecordReader and processing data in the 
split before moving to the next.
+ 
+  Changes to skew join sampling (PoissonSampleLoader) 
+ See discussion in [[https://issues.apache.org/jira/browse/PIG-1062|PIG-1062]] 
.
+ 
+ '''Problem 1''':
+ Earlier version of !PoissonSampleLoader stored the size on disk as an extra 
last column in the sampled tuples it returned in map phase of sampling MR job. 
This was used in !PartitionSkwewedKeys udf in the reduce stage of sampling job 
to compute total number of tuples using 
input-file-size/avg-disk-sz-from-samples . Avg-disk-sz-from-samples is not 
available with new loader design, because getPosition() is not there.
+ 
+ '''Solution :'''
+ !PoissonSampleLoader returns a special tuple with number of rows in that Map, 
in addition to the sampled tuples. To create this special tuple, the max row 
length in input sampled tuples is tracked, and a new tuple with size of 
max_row_length + 2 is created.
+ And spl_tuple[max_row_length ] = "marker_string"
+  spl_tuple[max_row_length + 1] = num_rows
+ The size of max_row_length+2 is used because the join key can be an 
expression, which is evaluated on the columns in tuples returned by the 
sampler, and the expression might expect specific data types to be present in 
certain (<= max_row_length) locations of the tuple.
+ If number of tuples in sample is 0, the special tuple is not sent.
+ 
+ In !PartitionSkwewedKeys udf in the reduce stage,the udf iterates over the 
tuples to find these special tuples and calculate the total number of rows.
+ 
+ 
+ '''Problem 2''':
+ !PoissonSampleLoader samples 17 tuples from every set of tuples that will fit 
into reducer memory (see PigSkewedJoinSpec) . Let us call this number of tuples 
that fit into reducer memory - X. Ie we need to sample one tuple every X/17 
tuples.
+ Earlier, the number of tuples to be sampled was calculated before the tuples 
were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of 
samples to be sampled in a map, the formula used was = 
number-of-reducer-memories-needed * 17 / number-of-splits
+ Where -
+ number-of-reducer-memories-needed = (total_file_size * 
disk_to_mem_factor)/available_reducer_heap_size
+ disk_to_mem_factor has default of 2.
+ 
+ Then !PoissonSampleLoader would return sampled tuples by  skipping 
split-size/num_samples bytes at a time.
+ 
+ With new loader we have to skip some number of tuples instead of bytes. But 
we don't have an estimate of total number of tuples in the input.
+ One way to work around this would be to use size of tuple in memory to 
estimate size of tuple in disk using above disk_to_mem_factor, then number of 
tuples to be skipped will be = (split-size/avg_mem_size_of_tuple)/numSamples
+ 
+ But the use of disk_to_mem_factor is very dubious, the real 
disk_to_mem_factor will vary based on compression-algorithm, data 
characteristics (sorting etc), and encoding.
+ 
+ '''Solution''':
+ The goal is to sample one tuple every X/17 tuples. (X = number of tuples that 
fit in available reducer memory)
+ To estimate X, we can use available_reducer_heap_size/average-tuple-mem-size
+ Number of tuples skipped for every sampled tuple = 1/17 * ( 
available_reducer_heap_size/average-tuple-mem-size)
+ 
+ The average-tuple-mem-size and 
number-of-tuples-to-be-skippled-every-sampled-tuple is recalculated after a new 
tuple is sampled.
+ 
+  Changes to order-by sampling (RandomSampler) 
+ 
+ '''Problem''': With new interface, we cannot use the old approach of dividing 
the size of file by number of samples required and skipping that many bytes to 
get new sample.
+ 
+ '''Proposal''': 
+ In getNext(),allocate a buffer for T elements, populate it with the first 
T tuples, and continue scanning the partition. For every ith next() call, 
generate a random number r s.t. 0<=r

[Pig Wiki] Update of "PigAccumulatorSpec" by yinghe

2009-11-13 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigAccumulatorSpec" page has been changed by yinghe.
http://wiki.apache.org/pig/PigAccumulatorSpec

--

New page:
= Accumulator UDF =

== Introduction ==
For data processing with PIG, it is very common to call "group by" or "cogroup" 
to group input tuples by a key, then call one or more UDFs to process each 
group. For example:

{{{
A = load 'mydata';
B = group A by $0;
C = foreach B generate group, myUDF1(A), myUDF2(A, 'some_param'), myUDF3(A);
store C into 'myresult';
}}}

The current implementation is during grouping process, all tuples that belongs 
to the same key are materialized into a DataBag, and the DataBag(s) are passed 
to the UDFs. This causes performance and memory problem. For a large key, if 
its tuples can not fit into memory, performance has to sacrifice to spill extra 
data into disk. 

Since many UDFs do not really need to see all the tuples that belongs to a key 
at the same time, it is possible to pass those tuples as batches. A good 
example would be like COUNT(), SUM(). Tuples can be passed to UDFs in 
accumulative manner. When all the tuples are passed, the final method is called 
to retrieve the value. This way, we can minimize the memory usage and improve 
performance by avoiding data spill.

== UDF change ==
An Accumulator interface is defined. UDFs that are able to process tuples in 
accumulative manner should implement this interface. It is defined as following:

{{{
public interface Accumulator  {
/**
 * Pass tuples to the UDF.  You can retrive DataBag by calling 
b.get(index). 
 * Each DataBag may contain 0 to many tuples for current key
 */
public void accumulate(Tuple b) throws IOException;

/**
 * Called when all tuples from current key have been passed to accumulate.
 * @return the value for the UDF for this key.
 */
public T getValue();

/** 
 * Called after getValue() to prepare processing for next key. 
 */
public void cleanup();

}
}}}

UDF should still extend EvalFunc as before. The PIG engine would detect based 
on context whether tuples can be processed accumulatively. If not, then regular 
EvalFunc would be called. Therefore, for a UDF, both interfaces should be 
implemented properly

== Use Cases ==
PIG engine would process tuples accumulatively only when all of the UDFs 
implements Accumulator interface. If one of the UDF is not Accumulator, then 
all UDFs are called by their EvalFunc interface as regular UDFs. Following are 
examples accumulator interface of UDFs would be called:

   * group by 
 {{{
 A = load 'mydata';
 B = group A by $0;
 C = foreach B generate group, myUDF(A);
 store C into 'myresult';
 }}}

   * cogroup 
 {{{
 A = load 'mydata1';
 B = load 'mydata2';
 C = cogroup A by $0, B by $0;
 D = foreach C generate group, myUDF(A), myUDF(B);
 store D into 'myresult';
 }}}

   * group by with sort
 {{{
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, myUDF(D);
   }
 store C into 'myresult';
 }}}

   * group by with distinct
 {{{
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, myUDF(E);
   }
 store C into 'myresult';
 }}}

== When to Call Accumulator ==
  MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible to 
run in accumulative mode. Before AccumulatorOptimizer is called, another 
optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks 
if POSort or PODistinct in the inner plan of foreach can be removed/replaced by 
using secondary sorting key supported by hadoop. If it is POSort, then it is 
removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of 
this optimizer, the last two use cases with order by and distinct inside 
foreach inner plan can still run in accumulative mode.

  The AccumulatorOptimizer checks the reducer plan and enables accumulator if 
following criteria are met:
   * The reducer plan uses POPackage as root, not any of its sub-classes. 
POPackage is not for distinct, and any of its input is not set as inner.
   * The successor of POPackage is a POForeach.
   * The leaves of each POForEach input plan is an ExpressionOperator and it 
must be one of the following:
 * ConstantExpression
 * POProject, whose result type is not BAG, or TUPLE and overloaded
 * POMapLookup
 * POCase
 * UnaryExpressionOperator
 * BinaryExpressionOperator
 * POBinCond
 * POUserFunc that implements Accumulator interface and its inputs 
contains only ExpressionOperation, POForEach, or POSortedDistinct, but not 
another POUserFunc.

Therefore, if under POForEach, there ar

[Pig Wiki] Update of "PigAccumulatorSpec" by yinghe

2009-11-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigAccumulatorSpec" page has been changed by yinghe.
http://wiki.apache.org/pig/PigAccumulatorSpec?action=diff&rev1=1&rev2=2

--

  = Accumulator UDF =
- 
  == Introduction ==
  For data processing with PIG, it is very common to call "group by" or 
"cogroup" to group input tuples by a key, then call one or more UDFs to process 
each group. For example:
  
@@ -11, +10 @@

  C = foreach B generate group, myUDF1(A), myUDF2(A, 'some_param'), myUDF3(A);
  store C into 'myresult';
  }}}
- 
- The current implementation is during grouping process, all tuples that 
belongs to the same key are materialized into a DataBag, and the DataBag(s) are 
passed to the UDFs. This causes performance and memory problem. For a large 
key, if its tuples can not fit into memory, performance has to sacrifice to 
spill extra data into disk. 
+ The current implementation is during grouping process, all tuples that 
belongs to the same key are materialized into a DataBag, and the DataBag(s) are 
passed to the UDFs. This causes performance and memory problem. For a large 
key, if its tuples can not fit into memory, performance has to sacrifice to 
spill extra data into disk.
  
  Since many UDFs do not really need to see all the tuples that belongs to a 
key at the same time, it is possible to pass those tuples as batches. A good 
example would be like COUNT(), SUM(). Tuples can be passed to UDFs in 
accumulative manner. When all the tuples are passed, the final method is called 
to retrieve the value. This way, we can minimize the memory usage and improve 
performance by avoiding data spill.
  
@@ -22, +20 @@

  {{{
  public interface Accumulator  {
  /**
-  * Pass tuples to the UDF.  You can retrive DataBag by calling 
b.get(index). 
+  * Pass tuples to the UDF.  You can retrive DataBag by calling 
b.get(index).
   * Each DataBag may contain 0 to many tuples for current key
   */
  public void accumulate(Tuple b) throws IOException;
@@ -32, +30 @@

   * @return the value for the UDF for this key.
   */
  public T getValue();
- 
+ 
- /** 
+ /**
-  * Called after getValue() to prepare processing for next key. 
+  * Called after getValue() to prepare processing for next key.
   */
  public void cleanup();
  
  }
  }}}
- 
  UDF should still extend EvalFunc as before. The PIG engine would detect based 
on context whether tuples can be processed accumulatively. If not, then regular 
EvalFunc would be called. Therefore, for a UDF, both interfaces should be 
implemented properly
  
  == Use Cases ==
  PIG engine would process tuples accumulatively only when all of the UDFs 
implements Accumulator interface. If one of the UDF is not Accumulator, then 
all UDFs are called by their EvalFunc interface as regular UDFs. Following are 
examples accumulator interface of UDFs would be called:
  
-* group by 
+  * group by
-  {{{
+   . {{{
   A = load 'mydata';
   B = group A by $0;
   C = foreach B generate group, myUDF(A);
   store C into 'myresult';
-  }}}
+ }}}
  
-* cogroup 
+  * cogroup
-  {{{
+   . {{{
   A = load 'mydata1';
   B = load 'mydata2';
   C = cogroup A by $0, B by $0;
   D = foreach C generate group, myUDF(A), myUDF(B);
   store D into 'myresult';
-  }}}
+ }}}
  
-* group by with sort
+  * group by with sort
-  {{{
+   . {{{
   A = load 'mydata';
   B = group A by $0;
   C = foreach B {
@@ -72, +69 @@

   generate group, myUDF(D);
 }
   store C into 'myresult';
-  }}}
+ }}}
  
-* group by with distinct
+  * group by with distinct
-  {{{
+   . {{{
   A = load 'mydata';
   B = group A by $0;
   C = foreach B {
@@ -84, +81 @@

   generate group, myUDF(E);
 }
   store C into 'myresult';
-  }}}
+ }}}
  
  == When to Call Accumulator ==
-   MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible 
to run in accumulative mode. Before AccumulatorOptimizer is called, another 
optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks 
if POSort or PODistinct in the inner plan of foreach can be removed/replaced by 
using secondary sorting key supported by hadoop. If it is POSort, then it is 
removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of 
this optimizer, the last two use cases with order by and distinct inside 
foreach inner plan can still run in accumulative mode.
+  . MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible 
to run in accumulative mode. Before AccumulatorOptimizer is called, another 
optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks 
if POSort or PODistinct in the inner plan of foreach can be removed/replaced by 
using secondary sorting key

New attachment added to page PigAccumulatorSpec/homes/yinghe/De sktop on Pig Wiki

2009-11-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page "PigAccumulatorSpec/homes/yinghe/Desktop" 
for change notification. An attachment has been added to that page by yinghe. 
Following detailed information is available:

Attachment name: SequenceDiagram.jpg
Attachment size: 51846
Attachment link: 
http://wiki.apache.org/pig/PigAccumulatorSpec/homes/yinghe/Desktop?action=AttachFile&do=get&target=SequenceDiagram.jpg
Page link: http://wiki.apache.org/pig/PigAccumulatorSpec/homes/yinghe/Desktop

[Pig Wiki] Update of "LoadStoreRedesignProposal" by The jasNair

2009-11-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by ThejasNair.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=32&rev2=33

--

  
  '''Problem 2''':
  !PoissonSampleLoader samples 17 tuples from every set of tuples that will fit 
into reducer memory (see PigSkewedJoinSpec) . Let us call this number of tuples 
that fit into reducer memory - X. Ie we need to sample one tuple every X/17 
tuples.
- Earlier, the number of tuples to be sampled was calculated before the tuples 
were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of 
samples to be sampled in a map, the formula used was = 
number-of-reducer-memories-needed * 17 / number-of-splits
+ Earlier, the number of tuples to be sampled was calculated before the tuples 
were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of 
samples to be sampled in a map, the formula used was = 
number-of-reducer-memories-needed * 17 / number-of-splits <>
  Where -
- number-of-reducer-memories-needed = (total_file_size * 
disk_to_mem_factor)/available_reducer_heap_size
+ number-of-reducer-memories-needed = (total_file_size * 
disk_to_mem_factor)/available_reducer_heap_size<>
  disk_to_mem_factor has default of 2.
  
  Then !PoissonSampleLoader would return sampled tuples by  skipping 
split-size/num_samples bytes at a time.
  
- With new loader we have to skip some number of tuples instead of bytes. But 
we don't have an estimate of total number of tuples in the input.
+ With new loader we have to skip some number of tuples instead of bytes. But 
we don't have an estimate of total number of tuples in the input.<>
  One way to work around this would be to use size of tuple in memory to 
estimate size of tuple in disk using above disk_to_mem_factor, then number of 
tuples to be skipped will be = (split-size/avg_mem_size_of_tuple)/numSamples
  
  But the use of disk_to_mem_factor is very dubious, the real 
disk_to_mem_factor will vary based on compression-algorithm, data 
characteristics (sorting etc), and encoding.
  
  '''Solution''':
- The goal is to sample one tuple every X/17 tuples. (X = number of tuples that 
fit in available reducer memory)
+ The goal is to sample one tuple every X/17 tuples. (X = number of tuples that 
fit in available reducer memory).<>
- To estimate X, we can use available_reducer_heap_size/average-tuple-mem-size
+ To estimate X, we can use 
available_reducer_heap_size/average-tuple-mem-size.<>
  Number of tuples skipped for every sampled tuple = 1/17 * ( 
available_reducer_heap_size/average-tuple-mem-size)
  
  The average-tuple-mem-size and 
number-of-tuples-to-be-skippled-every-sampled-tuple is recalculated after a new 
tuple is sampled.

[Pig Wiki] Update of "PigAccumulatorSpec" by yinghe

2009-11-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigAccumulatorSpec" page has been changed by yinghe.
http://wiki.apache.org/pig/PigAccumulatorSpec?action=diff&rev1=2&rev2=3

--

  }}}
  
  == When to Call Accumulator ==
-  . MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible 
to run in accumulative mode. Before AccumulatorOptimizer is called, another 
optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks 
if POSort or PODistinct in the inner plan of foreach can be removed/replaced by 
using secondary sorting key supported by hadoop. If it is POSort, then it is 
removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of 
this optimizer, the last two use cases with order by and distinct inside 
foreach inner plan can still run in accumulative mode.
+  . MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible 
to run in accumulative mode. Before AccumulatorOptimizer is called, another 
optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks 
if POSort or PODistinct in the inner plan of foreach can be removed/replaced by 
using secondary sorting key supported by hadoop. If it is POSort, then it is 
removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of 
this optimizer, the last two use cases with order by and distinct inside 
foreach inner plan can still run in accumulative mode. The AccumulatorOptimizer 
checks the reducer plan and enables accumulator if following criteria are met:
-  The AccumulatorOptimizer checks the reducer plan and enables accumulator if 
following criteria are met:
* The reducer plan uses POPackage as root, not any of its sub-classes. 
POPackage is not for distinct, and any of its input is not set as inner.
* The successor of POPackage is a POForeach.
* The leaves of each POForEach input plan is an ExpressionOperator and it 
must be one of the following:
@@ -109, +108 @@

  
  {{attachment:/homes/yinghe/Desktop/SequenceDiagram.jpg}}
  
+ == Internal Changes ==
+ === Accumulator ===
+  . A new interface that UDF can implement if it can run in accumulative mode.
+ 
+ === PhysicalOperator ===
+  . Add new methods setAccumulative(), setAccumStart(), setAccumEnd() to flag 
a physical operator to run in accumulative mode, and mark the start and end of 
accumulation. This change is in patch of PIG-1038.
+ 
+ === MapReduceLauncher ===
+  . Create AccumulatorOptimizer and use it to visit the plan.
+ 
+ === AccumulatorOptimizer ===
+  . Another MROpPlanVisitor. It checks the reduce plan, if it meets all the 
criteria, it sets the "accumulative" flag to POPackage and POForEach. It is 
created and invoked by MapReducerLauncher.
+ 
+ === POStatus ===
+  . Add a new state "STATUS_BATCH_OK" to indicate a batch is processed 
successfully in accumulative mode.
+ 
+ === POForEach ===
+  . If its "accumulative" flag is set, the bags passed to it through a tuple 
are AccumulativeBag as opposed to regular tuple bags. It gets 
AccumulativeTupleBuffer from the bag. Then it runs a while loop of calling 
nextBatch() of AccumulativeTupleBuffer, pass the input to inner plans. If an 
inner plan contains any UDF, the inner plan returns POStatus.STATUS_BATCH_OK if 
current batch is processed successfully. When there are no more batches to 
process, POForEach notifies each inner plan that accumulation is done, it makes 
a final call to get result and out of the while loop. At the end, POForEach 
returns the result to its successor in reducer plan. The operators that called 
POForEach doesn't need to know whether POForEach gets its result through 
regular mode or accumulative mode.
+ 
+ === AccumulativeBag ===
+  . An implementation of DataBag use by POPackage for processing data in 
accumulative mode. This bag doesn't contain all tuples from iterator. Instead, 
it wrapps up AccumultiveTupleBuffer, which contains iterator to pull tuples out 
in batches. Call the iterator() of this call only gives you the tuples for 
current batch.
+ 
+ === AccumulativeTupleBuffer ===
+  . An underlying buffer that is shared by all AccumulativeBags (one bag for 
group by, multiple bags for cogroup) generated by POPackage. POPackage has an 
inner class which implements this interface. POPackage creates an instance of 
this buffer and set it into the AccumulativeBags. This buffer has methods to 
retrieve next batch of tuples, which in turn calls methods of POPackage to read 
tuples out of iterator, and put them in an internal list. The AccumulativeBag 
has access to that list to return iterator of tuples.
+ 
+ === POPackage ===
+  . If its "accumulative" flag is set, it creates AccumulativeBag and 
AccumulativeTupleBuffer as opposed to creating default tuple bags. It then sets 
AccumulativeTupleBuffer into AccumulativeBag, and set ACcumulativeBag into the 
tuple in result.
+  POPack

[Pig Wiki] Trivial Update of "PigStreamingFunctionalSpec " by MarcioSilva

2009-11-17 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigStreamingFunctionalSpec" page has been changed by MarcioSilva.
The comment on this change is: correcting what appears to be a typo..
http://wiki.apache.org/pig/PigStreamingFunctionalSpec?action=diff&rev1=47&rev2=48

--

  Streaming can have three separate meaning in the context of Pig project:
  
   1. A specific way of submitting jobs to Hadoop: Hadoop Streaming
-  2. A form of processing in which the entire portion of the dataset that 
corresponds to a task in sent to the task and output streams out. There is no 
temporal or causal correspondence between an input record and specific output 
records.
+  2. A form of processing in which the entire portion of the dataset that 
corresponds to a task is sent to the task and output streams out. There is no 
temporal or causal correspondence between an input record and specific output 
records.
   3. The use of non-Java functions with Pig.
  
  The goal of Pig with respect to streaming is to support #2 for (a)Java UDFs, 
(b)non-Java UDFs and (c)user specified binaries/scripts. We will start with (c) 
since it would be most beneficial for the users. It is not our goal to be 
feature-by-feature compatible with Hadoop streaming as it is too open-ended and 
might force us to implement features that we don't necessarily want in Pig.

[Pig Wiki] Update of "PigSkewedJoinSpec" by ThejasNair

2009-11-19 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigSkewedJoinSpec" page has been changed by ThejasNair.
http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=12&rev2=13

--

  
  In order to use skewed join,
  
-* Skewed join currently works with tow-table inner join.
+* Skewed join currently works with two-table inner join.
 * Append 'using "skewed"' construct to the join to force pig to use skewed 
join
 * pig.skewedjoin.reduce.memusage specifies the fraction of heap available 
for the reducer to perform the join. A low fraction forces pig to use more 
reducers but increases copying cost. For pigmix tests, we have seen good 
performance when we set this value in the range 0.1 - 0.4. However, note that 
this is hardly an accurate range. Its value depends on the amount of heap 
available for the operation, the number of columns in the input and the skew. 
It is best obtained by conducting experiments to achieve a good performance. 
The default value is =0.5=.

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates

2009-11-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by AlanGates.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=35&rev2=36

--

  '''Proposal''': 
  The goal is to sample tuples with equal probability for any tuple getting 
sampled (assuming number of tuples to be sampled is much smaller than total 
number of tuples).
 If N is the number of samples required. In getNext(),allocate a buffer for 
N elements, populate it with the first N tuples, and continue scanning the 
partition. For every ith next() call, generate a random number r s.t. 0<=rhttp://hadoop.apache.org/common/docs/r0.20.1/streaming.html .)
+ 
+ I propose that Pig move to a model of using Hadoop's default streaming 
format, which is to expect new-line separated records, with tab being used as a 
field separator.  Hadoop allows users to
+ redefine the field separator, and so should Pig.  This will also match the 
current default of using !PigStorage as the (de)serializer for streaming.  As 
before Pig should support communicating
+ with the executable via either stdin and stdout or files.  This will force a 
syntax change in Pig Latin.  Currently, if a user wants to stream data to an 
executable with comma separated fields
+ instead of tab separated fields, the syntax is:
+ 
+ {{{
+ define CMD `perl PigStreaming.pl - foo nameMap` input(stdin using 
PigStorage(',') output(stdout using PigStorage(',');
+ A = load 'file';
+ B = stream B through CMD;
+ }}}
+ 
+ The syntax should change to remove the reference to a store and load 
functions, as they are no longer meaningful.  Thus the above would become:
+ 
+ {{{
+ define CMD `perl PigStreaming.pl - foo nameMap` input(stdin using ',') 
output(stdout using ',');
+ A = load 'file';
+ B = stream B through CMD;
+ }}}
+ 
+ From an implementation viewpoint, the functionality required to write to and 
read from the streaming binary will be equivalent to the tuple parsing and 
serialization of !PigStorage.getNext() and
+ !PigStorage.putNext().  While it will not be possible to use PigStorage 
directly every effort should be made to share this code (most likely by putting 
the actual code in static
+ utility methods that can be called by each class) to avoid double code 
maintenance costs.
  
  
  === Remaining Tasks ===
@@ -665, +696 @@

   * Changes to order-by sampling (!RandomSampler)
   * Changes to skew join sampling (!PoissonSampleLoader)
  
+ Nov 23 2009, Gates
+  * Added section "Changes to Streaming"
+

[Pig Wiki] Trivial Update of "LoadStoreRedesignProposal " by DmitriyRyaboy

2009-11-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by DmitriyRyaboy.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=36&rev2=37

--

  
  /**
   * Set statistics about the data being written.
+  * @throws IOException
   */
- void setStatistics(ResourceStatistics stats);
+ void setStatistics(ResourceStatistics stats, String location, 
Configuration conf) throws IOException;
+ 
+ /**
+  * Set schema of the data being written
+  * @throws IOException 
+  */
+ void setSchema(ResourceSchema schema, String location, Configuration 
conf) throws IOException;
  
  }
  
@@ -699, +706 @@

  Nov 23 2009, Gates
   * Added section "Changes to Streaming"
  
+ Nov 23 2009, Dmitriy Ryaboy
+  * updated StoreMetadata to match changes made to LoadMetadata
+

[Pig Wiki] Update of "GroupFunction" by AlanGates

2009-11-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "GroupFunction" page has been changed by AlanGates.
http://wiki.apache.org/pig/GroupFunction?action=diff&rev1=2&rev2=3

--

  <>
+ 
+ '''AS OF PIG 0.2 GROUP FUNCTIONS HAVE BEEN REMOVED FROM THE LANGUAGE.  THE 
FOLLOWING APPLIES ONLY TO PIG 0.1.'''
+ 
  == Group Functions ==

[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates

2009-11-25 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by AlanGates.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=38&rev2=39

--

  !PigStorage.putNext().  While it will not be possible to use PigStorage 
directly every effort should be made to share this code (most likely by putting 
the actual code in static
  utility methods that can be called by each class) to avoid double code 
maintenance costs.
  
+ It has been suggested that we should switch to the typed bytes protocol that 
is available in Hadoop and Hive (see
+ 
https://issues.apache.org/jira/browse/PIG-966?focusedCommentId=12781695&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781695
 ).  While we cannot switch the default, we can make this streaming
+ connection an interface so that users can easily extend it in the future.  
The interface should be quite simple:
+ 
+ {{{
+ interface PigToStream {
+ 
+ /**
+  * Given a tuple, produce an array of bytes to be passed to the 
streaming
+  * executable.
+  */
+ public byte[] serialize(Tuple t) throws IOException;
+ 
+ /**
+  * Set the record delimiter to use when communicating with the 
streaming
+  * executable.  The default if this is not set is \n.
+  */
+ public void setRecordDelimiter(byte delimiter);
+ }
+ 
+ interface StreamToPig {
+ 
+ /**
+  *  Given a byte array from a streaming executable, produce a tuple.
+  */
+ public Tuple deserialize(byte[]) throws IOException;
+ 
+ /**
+  * Set the record delimiter to use when reading from the streaming
+  * executable.  The default if this is not set is \n.
+  */
+ public void setRecordDelimiter(byte delimiter);
+ }
+ }}}
+ 
+ The default implementation of this would be as suggested above.  The syntax 
for describing how data is (de)serialized would then stay as it currently is, 
except instead of giving a
+ !StoreFunc the user would specify a !PigToStream, and instead of specifying a 
!LoadFunc a !StreamToPig.
+ 
+ Additionally, it has been noted that this change takes away the current 
optimization of Pig Latin scripts such as the following:
+ 
+ {{{
+ A = load 'myfile' split by 'file';
+ B = stream A through 'mycmd';
+ store B into 'outfile';
+ }}}
+ 
+ In this case Pig will optimize the query by removing the load function and 
replacing it with !BinaryStorage, a function which simply passes the data as is 
to the streaming
+ executable.  It does not record or field parsing.  Similarly, the store in 
the above script would be replaced with !BinaryStorage.
+ 
+ We have two options to replace this.  First, we could say that if a class 
implementing !PigToStream also implements !InputFormat, then Pig will drop the 
Load statement and use that
+ !InputFormat directly to load data and then pass the results to the stream.  
The same would be done with !StreamToPig, !OutputFormat and store.  Second, we 
could create
+ !IdentityLoader and !IdentityStreamToPig functions.  !IdentityLoader.getNext 
would return a tuple that just had one bytearray, which would be the entire 
record.  This would then be a
+ trivial serialization via the default !PigToStream.  Similarly 
!IdentityStreamToPig would take the bytes returned by the stream and put them 
in a tuple of a single bytearray.  The
+ store function would then naturally translate this tuple into the underlying 
bytes.
+ Functionally these are basically equivalent, since Pig would need to write 
code similar to the !IdentityLoader etc. for the second case.  So I believe the 
primary difference is in
+ how it is presented to the user not the functionality or code written 
underneath.
+ 
+ Both of these approaches suffer from the problem that they assume 
!TextInputFormat and !TextOutputFormat.  For any other IF/OF it will not be 
clear how to parse key, value
+ pairs out of the stream data.
+ 
+ This optimization represents a fair amount of work.  As the current 
optimization is not documented, it is not clear how many users are using it.  
Based on that I vote that we
+ do not implement this optimization until such time as we see a need for it.
  
  === Remaining Tasks ===
   * !BinStorage needs to implement !LoadMetadata's getSchema() to replace 
current determineSchema()
@@ -709, +771 @@

  Nov 23 2009, Dmitriy Ryaboy
   * updated StoreMetadata to match changes made to LoadMetadata
  
+ Nov 25 2009, Gates
+  * Updated section on streaming to suggest creating an interface for 
streaming (de)serializers rather than having only one hardwired option.  Also 
added some thoughts on possible replacements for the current 
!BinaryStorage/split by file optimization.
+

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-11-30 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=132&rev2=133

--

  ||6015||During execution, encountered a Hadoop error.||
  ||6016||Out of memory.||
  ||6017||Execution failed, while processing '||
+ ||6018||Error while reading input||
  
  == Change Log ==

[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanMa njunath

2009-12-01 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigSkewedJoinSpec" page has been changed by SriranjanManjunath.
http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=13&rev2=14

--

  = Skewed Join =
  <>
+ 
  == Introduction ==
- 
- Parallel joins are vulnerable to the presence of skew in the underlying data. 
If the underlying data is sufficiently skewed, load imbalances will swamp any 
of the parallelism gains [[#References|(1)]]. In order to counteract this 
problem, skewed join computes a histogram of the key space and uses this data 
to allocate reducers for a given key. Skewed join does not place a restriction 
on the size of the input keys. It accomplishes this by splitting one of the 
input on the join predicate and streaming the other input.
+ Parallel joins are vulnerable to the presence of skew in the underlying data. 
If the underlying data is sufficiently skewed, load imbalances will swamp any 
of the parallelism gains [[#References|(1)]]. In order to counteract this 
problem, skewed join computes a histogram of the key space and uses this data 
to allocate reducers for a given key. Skewed join does not place a restriction 
on the size of the input keys. It accomplishes this by splitting one of the 
input on the join predicate and streaming the other input. <>
- <>
+ 
  == Use cases ==
- 
  Skewed join can be used when the underlying data is sufficiently skewed and 
the user needs a finer control over the allocation of reducers to counteract 
the skew. It should also be used when the data associated with a given key is 
too large to fit in memory.
  
  {{{
@@ -17, +16 @@

  
  C = JOIN big BY b1, massive BY m1 USING "skewed";
  }}}
- 
  In order to use skewed join,
  
-* Skewed join currently works with two-table inner join.
+  * Skewed join currently works with two-table inner join.
-* Append 'using "skewed"' construct to the join to force pig to use skewed 
join
+  * Append 'using "skewed"' construct to the join to force pig to use skewed 
join
-* pig.skewedjoin.reduce.memusage specifies the fraction of heap available 
for the reducer to perform the join. A low fraction forces pig to use more 
reducers but increases copying cost. For pigmix tests, we have seen good 
performance when we set this value in the range 0.1 - 0.4. However, note that 
this is hardly an accurate range. Its value depends on the amount of heap 
available for the operation, the number of columns in the input and the skew. 
It is best obtained by conducting experiments to achieve a good performance. 
The default value is =0.5=.
+  * pig.skewedjoin.reduce.memusage specifies the fraction of heap available 
for the reducer to perform the join. A low fraction forces pig to use more 
reducers but increases copying cost. For pigmix tests, we have seen good 
performance when we set this value in the range 0.1 - 0.4. However, note that 
this is hardly an accurate range. Its value depends on the amount of heap 
available for the operation, the number of columns in the input and the skew. 
It is best obtained by conducting experiments to achieve a good performance. 
The default value is =0.5=.
- 
  
  <>
+ 
  == Requirements ==
- 
-* Support a 'skewed' condition for the join command - Modify Join operator 
to have a "skewed" option.
+  * Support a 'skewed' condition for the join command - Modify Join operator 
to have a "skewed" option.
-* Handle considerably large skew in the input data efficiently
+  * Handle considerably large skew in the input data efficiently
-* Join tables whose keys are too big to fit in memory
+  * Join tables whose keys are too big to fit in memory
+ 
  <>
+ 
  == Implementation ==
- 
  Skewed join translates into two map/reduce jobs - Sample and Join. The first 
job samples the input records and computes a histogram of the underlying key 
space. The second map/reduce job partitions the input table and performs a join 
on the predicate. In order to join the two tables, one of the tables is 
partitioned and other is streamed to the reducer. The map task of the join job 
uses the ~-pig.keydist-~ file to determine the number of reducers per key. It 
then sends the key to each of the reducers in a round robin fashion. Skewed 
joins happen in the reduce phase of the join job.
  
  {{attachment:partition.jpg}}
  
  <>
+ 
  === Sampler phase ===
- If the underlying data is sufficiently skewed, load imbalances will result in 
a few reducers getting a lot of keys. As a first task, the sampler creates a 
histogram of the key distribution and stores it in the ~-pig.keydist-~ file. In 
order to reduce spillage, the sampler conservatively estimates the number of 
rows that can be sent to a single reducer based on the memory available for the 
reducer. The memory available for the reducer is a product of the heap size and 
the memusage parameter specified by the user. Using

[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanMa njunath

2009-12-01 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigSkewedJoinSpec" page has been changed by SriranjanManjunath.
http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=14&rev2=15

--

  
  '''!NullablePartitionWritable'''
  
- This is an adapter class which provides a partition index to the 
NullableWritable class. The partition index is used by both the paritioning and 
the streaming table. For non skewed keys, this value is set to -1.
+ This is an adapter class which provides a partition index to the 
!NullableWritable class. The partition index is used by both the paritioning 
and the streaming table. For non skewed keys, this value is set to -1.
  
  '''!PigMapReduce'''

[Pig Wiki] Update of "PigSkewedJoinSpec" by yinghe

2009-12-14 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigSkewedJoinSpec" page has been changed by yinghe.
http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=15&rev2=16

--

 Number of Tuples from First Table (tupleCount) = (sampleCount / 
totalSampleCount) * (inputFileSize / avgDiskUsage)
 Number of Reducers = (int) Math.round(Math.ceil((double) tupleCount / 
tupleMCount));
  }}}
+ 
+ For example, if we assume
+  * total number of samples = 200 
+  * total number of samples with key k1 = 30 
+  * size of input file = 1G.
+  * totalMemory = 150M
+  * avgMemUsage for tuples of k1 = 150 bytes
+  * avgDiskUsage for tuples of k1 = 100 bytes
+ 
+ then,
+  * estimated total number of k1 that can fit in memory = 150M/150 = 1M
+  * estimated total number of tuples from input file = 1G/100 = 10M tuples 
+  * estimated number of tuples for k1 from input file = (30/200) * 10M = 1.5M
+  * estimated total number of reducers for k1 = Math.ceil (1.5M/1M) = 2
+ 
+ This calculation is done on every key of samples. If a key requires more than 
1 reducer, it is regarded as a skewed key, and pre-allocated with multiple 
reducers. The reducers are allocated to skewed keys in round robin fashion. 
+ 
  This UDF generates an output which will be used by the following join job. 
The format of the output file is a map. It has two keys:
  
   * totalreducers: the number of total reducers for second job

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-12-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=133&rev2=134

--

  ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||1108||Duplicated schema||
  ||1109||Input (  ) on which outer join is desired should have a 
valid schema||
+ ||1110||"Unsupported query: You have an partition column () inside a 
 in the 
filter condition.||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-12-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=134&rev2=135

--

  ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal 
type)||
  ||1108||Duplicated schema||
  ||1109||Input (  ) on which outer join is desired should have a 
valid schema||
- ||1110||"Unsupported query: You have an partition column () inside a 
 in the 
filter condition.||
+ ||1110||Unsupported query: You have an partition column () inside a 
 in the 
filter condition.||
+ ||||Use of partition column/condition with non partition column/condition 
in filter expression is not supported.||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-12-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=135&rev2=136

--

  ||2206||Error visiting POSort inner plan||
  ||2207||POForEach inner plan has more than 1 root||
  ||2208||Exception visiting foreach inner plan||
+ ||2209||Internal error while processing any partition filter conditions in 
the filter after the load||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges

--

New page:
= Backward incompatible changes in Pig 0.7.0 =

Pig 0.7.0 will include some major changes to Pig most of them driven by the 
[[LoadStoreRedesignProposal | Load-Store redesign]]. Some of this changes will 
not be backward compatible and will require users to change the pig scripts or 
their UDFs. This document is intended to keep track of this changes to that we 
can document them for the release.

== Changes to the Load and Store functions ==
== Handling Compressed Data ==
== Local Mode ==
== Streaming ==
== Other Changes ==

- Split by file

== Open Questions ==

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-12-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=136&rev2=137

--

  ||1109||Input (  ) on which outer join is desired should have a 
valid schema||
  ||1110||Unsupported query: You have an partition column () inside a 
 in the 
filter condition.||
  ||||Use of partition column/condition with non partition column/condition 
in filter expression is not supported.||
+ ||1112||Unsupported query: You have an partition column () in a 
construction like: (pcond  and ...) or (pcond and ...) where pcond is a 
condition on a partition column.||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=1&rev2=2

--

  = Backward incompatible changes in Pig 0.7.0 =
  
- Pig 0.7.0 will include some major changes to Pig most of them driven by the 
[[LoadStoreRedesignProposal | Load-Store redesign]]. Some of this changes will 
not be backward compatible and will require users to change the pig scripts or 
their UDFs. This document is intended to keep track of this changes to that we 
can document them for the release.
+ Pig 0.7.0 will include some major changes to Pig most of them driven by the 
[[LoadStoreRedesignProposal | Load-Store redesign]]. Some of these changes will 
not be backward compatible and will require users to change their pig scripts 
or their UDFs. This document is intended to keep track of such changes so that 
we can document them for the release.
  
- == Changes to the Load and Store functions ==
+ == Changes to the Load and Store Functions ==
  == Handling Compressed Data ==
+ 
+ In 0.6.0 or earlier versions Pig supported bzip compressed files with 
extensions of .bz or .bz2 as well as gzip compressed files with .gz extension. 
Pig was able to both read and write files in this format with the understanding 
that gzip compressed files could not be split across multiple maps while bzip 
compressed files could. Also, data compression was completely decoupled from 
the data format and Load/Store functions meaning that any loader could read 
compressed data and any store function could write it just by the virtue of 
having the right extension on the files it was reading or writing.
+ 
+ With Pig 0.7.0 the read/write functionality is taking over by Hadoop's 
Input/OutputFormat and how compression is handled or whether it is handled at 
all depends on the Input/OutputFormat used by the loader/store function.
+ 
+ The main input format that supports compression is TextInputFormat. It 
supports bzip files with .bz2 extension and gzip files with .gz extension. 
'''Note that it does not support .bz files'''. PigStorage is the only loader 
that comes with Pig that is derived from TextInputFormat which means it will be 
able to handle .bz2 and .gz files. Other laders such as BinStorage will no 
longer support compression.
+ 
+ On the store side, TextOutputFormat also supports compression but the store 
function needs do to additional work to enable it. Again, PigStorage will 
support compressions while other functions will not.
+ 
+ If you have a custom load/store function that needs to support compression, 
you would need to make sure that the underlying Input/OutputFormat supports 
this type of compression.
+ 
  == Local Mode ==
  == Streaming ==
  == Other Changes ==

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=2&rev2=3

--

  
  == Local Mode ==
  == Streaming ==
- == Other Changes ==
+ == Split by File ==
  
- - Split by file
+ In the earlier versions of Pig, a user could specify "split by file" on the 
loader statement which would make sure that each map got the entire file rather 
than the files were further divided into blocks. This feature was primarily 
design for streaming optimization but could also be used with loaders that 
can't deal with incomplete records. We don't believe that this functionality 
has been widely used.
+ 
+ Because the slicing of the data is no longer in Pig's control, we can't 
support this feature generically for every loader. If a particular loader needs 
this functionality, it will need to make sure that the underlying InputFormat 
supports it. 
+ 
+ We will have a different approach for streaming optimization if that 
functionality is necessary.
  
  == Open Questions ==

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-12-17 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=137&rev2=138

--

  ||2207||POForEach inner plan has more than 1 root||
  ||2208||Exception visiting foreach inner plan||
  ||2209||Internal error while processing any partition filter conditions in 
the filter after the load||
+ ||2210||Internal Error in logical optimizer.||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-12-17 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=138&rev2=139

--

  ||1110||Unsupported query: You have an partition column () inside a 
 in the 
filter condition.||
  ||||Use of partition column/condition with non partition column/condition 
in filter expression is not supported.||
  ||1112||Unsupported query: You have an partition column () in a 
construction like: (pcond  and ...) or (pcond and ...) where pcond is a 
condition on a partition column.||
+ ||1113||Please provide uri to the metadata server using -Dmetadata.uri system 
property||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2009-12-17 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=139&rev2=140

--

  ||1110||Unsupported query: You have an partition column () inside a 
 in the 
filter condition.||
  ||||Use of partition column/condition with non partition column/condition 
in filter expression is not supported.||
  ||1112||Unsupported query: You have an partition column () in a 
construction like: (pcond  and ...) or (pcond and ...) where pcond is a 
condition on a partition column.||
- ||1113||Please provide uri to the metadata server using -Dmetadata.uri system 
property||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-21 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=3&rev2=4

--

  Pig 0.7.0 will include some major changes to Pig most of them driven by the 
[[LoadStoreRedesignProposal | Load-Store redesign]]. Some of these changes will 
not be backward compatible and will require users to change their pig scripts 
or their UDFs. This document is intended to keep track of such changes so that 
we can document them for the release.
  
  == Changes to the Load and Store Functions ==
+ 
+ TBW
+ 
  == Handling Compressed Data ==
  
  In 0.6.0 or earlier versions Pig supported bzip compressed files with 
extensions of .bz or .bz2 as well as gzip compressed files with .gz extension. 
Pig was able to both read and write files in this format with the understanding 
that gzip compressed files could not be split across multiple maps while bzip 
compressed files could. Also, data compression was completely decoupled from 
the data format and Load/Store functions meaning that any loader could read 
compressed data and any store function could write it just by the virtue of 
having the right extension on the files it was reading or writing.
@@ -19, +22 @@

  
  == Local Mode ==
  == Streaming ==
+ 
+ There are two things that are changing in streaming.
+ 
+ First, in the initial (0.7.0) release, '''we will not support for 
optimization''' where if streaming follows load of compatible format or is 
followed by format compatible store the data is not parsed but passed in chunks 
from the loader or to the store. The main reason we are not porting the 
optimization is that the work is not trivial and that the optimization was 
never documented and so unlekly to be used.
+ 
+ Second, '''you can no longer use load/store functions for 
(de)serialization.''' 
+ 
  == Split by File ==
  
  In the earlier versions of Pig, a user could specify "split by file" on the 
loader statement which would make sure that each map got the entire file rather 
than the files were further divided into blocks. This feature was primarily 
design for streaming optimization but could also be used with loaders that 
can't deal with incomplete records. We don't believe that this functionality 
has been widely used.

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-21 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=4&rev2=5

--

  
  == Changes to the Load and Store Functions ==
  
- TBW
+ TBW [Need to take a load (with and withoutcustom slicer) and a store function 
and create new versions as examples. Can use PigStorage for (1) and (3) but 
need some loader for (2).]
+ 
  
  == Handling Compressed Data ==
  
@@ -21, +22 @@

  If you have a custom load/store function that needs to support compression, 
you would need to make sure that the underlying Input/OutputFormat supports 
this type of compression.
  
  == Local Mode ==
+ 
+ The main change here is that we switched from Pig's native local mode to 
Hadoop's local mode. This change should be transparent for most applications. 
Possible differnces you will see:
+ 
+  1. Hadoop local mode is about order of magnitude slower than Pig's local 
mode. Something that Hadoop team promised to address.
+  2. For algebraic functions, no the entire Algebraic interface will be used 
which is likely a good think if you are using local mode for testing your 
production applications.
+ 
  == Streaming ==
  
  There are two things that are changing in streaming.
  
  First, in the initial (0.7.0) release, '''we will not support for 
optimization''' where if streaming follows load of compatible format or is 
followed by format compatible store the data is not parsed but passed in chunks 
from the loader or to the store. The main reason we are not porting the 
optimization is that the work is not trivial and that the optimization was 
never documented and so unlekly to be used.
  
- Second, '''you can no longer use load/store functions for 
(de)serialization.''' 
+ Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The defaul (PigStorage) format will 
continue to work. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
  == Split by File ==

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-21 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=5&rev2=6

--

  
  == Changes to the Load and Store Functions ==
  
- TBW [Need to take a load (with and withoutcustom slicer) and a store function 
and create new versions as examples. Can use PigStorage for (1) and (3) but 
need some loader for (2).]
+ TBW [Need to take a load (with and without custom slicer) and a store 
function and create new versions as examples. Can use PigStorage for (1) and 
(3) but need to choose a loader for (2).]
  
  
  == Handling Compressed Data ==
@@ -32, +32 @@

  
  There are two things that are changing in streaming.
  
- First, in the initial (0.7.0) release, '''we will not support for 
optimization''' where if streaming follows load of compatible format or is 
followed by format compatible store the data is not parsed but passed in chunks 
from the loader or to the store. The main reason we are not porting the 
optimization is that the work is not trivial and that the optimization was 
never documented and so unlekly to be used.
+ First, in the initial (0.7.0) release, '''we will not support for 
optimization''' where if streaming follows load of compatible format or is 
followed by format compatible store the data is not parsed but passed in chunks 
from the loader or to the store. The main reason we are not porting the 
optimization is that the work is not trivial and that the optimization was 
never documented and so unlikely to be used.
  
- Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The defaul (PigStorage) format will 
continue to work. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
+ Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
  == Split by File ==

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-21 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=6&rev2=7

--

  
  == Local Mode ==
  
- The main change here is that we switched from Pig's native local mode to 
Hadoop's local mode. This change should be transparent for most applications. 
Possible differnces you will see:
+ The main change here is that we switched from Pig's native local mode to 
Hadoop's local mode. This change should be transparent for most applications. 
Possible differnces you will see are:
  
   1. Hadoop local mode is about order of magnitude slower than Pig's local 
mode. Something that Hadoop team promised to address.
-  2. For algebraic functions, no the entire Algebraic interface will be used 
which is likely a good think if you are using local mode for testing your 
production applications.
+  2. For algebraic functions, now the entire Algebraic interface will be used 
which is likely a good thing if you are using local mode for testing your 
production applications.
  
  == Streaming ==
  
  There are two things that are changing in streaming.
  
- First, in the initial (0.7.0) release, '''we will not support for 
optimization''' where if streaming follows load of compatible format or is 
followed by format compatible store the data is not parsed but passed in chunks 
from the loader or to the store. The main reason we are not porting the 
optimization is that the work is not trivial and that the optimization was 
never documented and so unlikely to be used.
+ First, in the initial (0.7.0) release, '''we will not support optimization''' 
where if streaming follows load of compatible format or is followed by format 
compatible store the data is not parsed but passed in chunks from the loader or 
to the store. The main reason we are not porting the optimization is that the 
work is not trivial and the optimization was never documented and so unlikely 
to be used.
  
  Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
@@ -46, +46 @@

  
  == Open Questions ==
  
+ Q: Should String->Text conversion be part of this release.
+ A: Pros: 20-30% improved memory utilization; cons: more compatibility is 
broken.
+

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=7&rev2=8

--

  
  We will have a different approach for streaming optimization if that 
functionality is necessary.
  
+ == Access to Local Files from Map-Reduce Mode
+ 
+ In the earlier version of Pig, you could access a local file from map-reduce 
mode by prepending file:// to the file location:
+ 
+ {{{
+ A = load 'file:/mydir/myfile';
+ ...
+ }}}
+ 
+ When Pig processed this statement, it would first copy the data to DFS and 
then import it into the execution pipeline.
+ 
+ In Pig 0.7.0, you can no longer do this and if this functionality is still 
desired, you can add the copy into your script manually:
+ 
+ {{{
+ fs copyFromLocal src dist
+ A = load 'dist';
+ 
+ }}}
+ 
  == Open Questions ==
  
  Q: Should String->Text conversion be part of this release.

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=8&rev2=9

--

  
  First, in the initial (0.7.0) release, '''we will not support optimization''' 
where if streaming follows load of compatible format or is followed by format 
compatible store the data is not parsed but passed in chunks from the loader or 
to the store. The main reason we are not porting the optimization is that the 
work is not trivial and the optimization was never documented and so unlikely 
to be used.
  
- Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
+ Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This formar is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
+ 
+ We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
straming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.
  
  == Split by File ==
  
@@ -44, +46 @@

  
  We will have a different approach for streaming optimization if that 
functionality is necessary.
  
- == Access to Local Files from Map-Reduce Mode
+ == Access to Local Files from Map-Reduce Mode ==
  
  In the earlier version of Pig, you could access a local file from map-reduce 
mode by prepending file:// to the file location:

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2009-12-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=9&rev2=10

--

  
  Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This formar is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
- We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
straming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.
+ We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
streaming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.
  
  == Split by File ==
  
@@ -60, +60 @@

  In Pig 0.7.0, you can no longer do this and if this functionality is still 
desired, you can add the copy into your script manually:
  
  {{{
- fs copyFromLocal src dist
+ fs -copyFromLocal src dist
  A = load 'dist';
  
  }}}

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=10&rev2=11

--

  
  }}}
  
+ == Removing Custom Comparators
+ 
+ This functionality was added to deal with gap in Pig's early functionality - 
lack of numeric comparison in order by as well as lack of descending sort. This 
functionality has been present in last 4 releases and custom comparators has 
been depricated in the last several releases. They functionality is removed in 
this release.
+ 
  == Open Questions ==
  
  Q: Should String->Text conversion be part of this release.

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN

2009-12-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by OlgaN.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=11&rev2=12

--

  
  }}}
  
- == Removing Custom Comparators
+ == Removing Custom Comparators ==
  
  This functionality was added to deal with gap in Pig's early functionality - 
lack of numeric comparison in order by as well as lack of descending sort. This 
functionality has been present in last 4 releases and custom comparators has 
been depricated in the last several releases. They functionality is removed in 
this release.

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2009-12-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=140&rev2=141

--

  ||2182||Prune column optimization: Only relational operator can be used in 
column prune optimization.||
  ||2183||Prune column optimization: LOLoad must be the root logical operator.||
  ||2184||Prune column optimization: Fields list inside RequiredFields is 
null.||
+ ||2185||Prune column optimization: Unable to prune columns.||
  ||2186||Prune column optimization: Cannot locate node from successor||
  ||2187||Column pruner: Cannot get predessors||
  ||2188||Column pruner: Cannot prune columns||
@@ -422, +423 @@

  ||2208||Exception visiting foreach inner plan||
  ||2209||Internal error while processing any partition filter conditions in 
the filter after the load||
  ||2210||Internal Error in logical optimizer.||
+ ||2211||Column pruner: Unable to prune columns.||
+ ||2212||Unable to prune plan.||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||

[Pig Wiki] Update of "PigMix" by AlanGates

2010-01-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigMix" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigMix?action=diff&rev1=13&rev2=14

--

  || PigMix_12 || 55.33|| 95.33 || 0.58   ||
  || Total || 1352.33  || 1357  || 1.00   ||
  || Weighted avg ||  ||   || 1.04   ||
+ 
+ Run date:  January 4, 2010, run against 0.6 branch as of that day
+ || Test  || Pig run time || Java run time || Multiplier ||
+ || PigMix_1  || 138.33   || 112.67|| 1.23   ||
+ || PigMix_2  || 66.33|| 39.33 || 1.69   ||
+ || PigMix_3  || 199  || 83.33 || 2.39   ||
+ || PigMix_4  || 59   || 60.67 || 0.97   ||
+ || PigMix_5  || 80.33|| 113.67|| 0.71   ||
+ || PigMix_6  || 65   || 77.67 || 0.84   ||
+ || PigMix_7  || 63.33|| 61|| 1.04   ||
+ || PigMix_8  || 40   || 47.67 || 0.84   ||
+ || PigMix_9  || 214  || 215.67|| 0.99   ||
+ || PigMix_10 || 284.67   || 284.33|| 1.00   ||
+ || PigMix_11 || 141.33   || 151.33|| 0.93   ||
+ || PigMix_12 || 55.67|| 115   || 0.48   ||
+ || Total || 1407 || 1362.33   || 1.03   ||
+ || Weighted Avg ||   ||   || 1.09   ||
+ 
  
  
  == Features Tested ==

[Pig Wiki] Update of "PigLogicalPlanOptimizerRewrite" b y AlanGates

2010-01-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigLogicalPlanOptimizerRewrite" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite

--

New page:
== Problem Statement ==
The current implementation of the logical plan and the logical optimizer in Pig 
has proven to not be easily extensible.  Developer feedback has indicated that 
adding new
rules to the optimizer is quite burdensome.  In addition, the logical plan has 
been an area of numerous bugs, many of which have been difficult to fix.  
Developers also feel
that the logical plan is difficult to understand and maintain.  The root cause 
for these issues is that a number of design decisions that were made as part of 
the 0.2 rewrite of the front end have now proven to be
sub-optimal.  The heart of this proposal is to revisit a number of those 
proposals and rebuild the logical plan with a simpler design that will make it 
much easier to
maintain the logical plan as well as extend the logical optimizer.

=== Issues that Need to be Addressed in this Rework ===
'''One:'''  !OperatorPlan has far too many operations.  It has 29 public 
methods.  This needs to be paired down to a minimal set of operators that are 
well defined.

'''Two:''' Currently, relational operators (Join, Sort, etc.) and expression 
operators (add, equals, etc.) are both !LogicalOperators.  Operators such as 
Cogroup that contain expressions have !OperatorPlans that contain these 
expressions.  This was done for two reasons:
 1. To make it easier for visitors to visit both types of operators (that is, 
visitors didn't have to have separate logic to handle expressions).
 1. To better handle the ambiguous nature of inner plans in Foreach.
However, it has led to visitors and graphs that are hard to understand.  Both 
of the above concerns can be handled while breaking this binding so that 
relational and expression operators are seaprate types.

'''Three:'''  Related to the issue of relational and expression operators 
sharing a type is that inner plans have connections to outer plans.  Take for 
example a script like

{{{
A = load 'file1' as (x, y);
B = load 'file2' as (u, v);
C = cogroup A by x, B by u
D = filter C by A.x > 0;
}}}
In this case the cogroup will have two inner plans, one of which will be a 
project of A.x and the other a project B.u.  The !LOProject objects 
representing these projections
will hold actual references to the !LOLoad operators for A and B.  This makes 
disconecting and rearranging nodes in the plan much more difficult.  Consider 
if the optimizer wants to
move the filter in D above C.  Now it has to not only change connections in the 
outer plan between load, cogroup, and filter; it also has to change connections 
in the
first inner plan of C, because this now needs to point to the !LOFilter for D 
rather than the !LOLoad for A.

'''Four:'''  The work done on Operator and !OperatorPlan to support the 
original rules for the optimizer had two main problems:
   1.. The set of primitives chosen were not the correct ones.
   1.. The operations chosen were put on the generic super classes (Operator) 
rather than further down on the specific classes that would know how to 
implement them.

'''Five:'''  At a number of points efforts were made to keep the logical plan 
close to the physical plan.  For example, !LOProject represents all of the same 
operations that
!POProject does.  While this is convenient in translation, it is not convenient 
when trying to optimize the plan.  The !LogicalPlan needs to focus on 
reprenting the logic of
the script in a way that is easy for semantic checkers (such as !TypeChecker) 
and the optimizer to work with.

'''Six:'''  The rule of one operation per operator was violated.  !LOProject 
handles three separate roles (converting from a relational to an expression 
operator, actually
projecting, and converting from an expression to a relational operator).  This 
makes coding much more complex for the optimizer because when it encounters an 
!LOProject it
must first determine which of these three roles it is playing before it can 
understand how to work with it.

The following proposal will address all of these issues.

== Proposed Methodology ==
Fixing these issues will require extensive changes, including a complete 
rewrite of Operator, !OperatorPlan, !PlanVisitor, !LogicalOperator, 
!LogicalPlan,
!LogicalPlanVisitor, every current subclass of !LogicalOperator, and all 
existing optimizer rules.  It will also require extensive changes, though not 
complete rewrites, in
existing subclasses of !LogicalTransformer.  To avoid destablizing the entire 
codebase during this operation, this will be done in a new set of packages as a 
totally separate
set of classes.  Linkage code will be written to translate the current 
!LogicalPlan to the new experimental !LogicalPlan class.  A new

New attachment added to page PigLogicalPlanOptimizerRewrite on Pig Wiki

2010-01-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page "PigLogicalPlanOptimizerRewrite" for change 
notification. An attachment has been added to that page by AlanGates. Following 
detailed information is available:

Attachment name: expressiontree.jpg
Attachment size: 28430
Attachment link: 
http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite?action=AttachFile&do=get&target=expressiontree.jpg
Page link: http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite

[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-01-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=12&rev2=13

--

  
  First, in the initial (0.7.0) release, '''we will not support optimization''' 
where if streaming follows load of compatible format or is followed by format 
compatible store the data is not parsed but passed in chunks from the loader or 
to the store. The main reason we are not porting the optimization is that the 
work is not trivial and the optimization was never documented and so unlikely 
to be used.
  
- Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This formar is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
+ Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that has to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This format is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
  We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
streaming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.

[Pig Wiki] Update of "PigJournal" by AlanGates

2010-01-13 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal

--

New page:
= Pig Journal =
This document is a successor to the ProposedRoadMap.  Rather than simply 
propose the work going forward for Pig, it also summarizes work done in the 
past (back
to Pig moving from a research project at Yahoo Labs to being a part of the 
Yahoo grid team, which was approximately the time Pig was first released to open
source), current work, and proposed future work.  Note that proposed future 
work is exactly that, __proposed__.  There is no guarantee that it will be 
done, and the
project is still open to input on whether and when such work should be done.

== Completed Work ==
The following table contains a list of features that have been completed, as of 
Pig 0.6

|| Feature  || Available in Release 
|| Comments ||
|| Describe Schema  || 0.1  
|| ||
|| Explain Plan || 0.1  
|| ||
|| Add log4j to Pig Latin   || 0.1  
|| ||
|| Parameterized Queries|| 0.1  
|| ||
|| Streaming|| 0.1  
|| ||
|| Documentation|| 0.2  
|| Docs are never really done of course, but Pig now has a setup document, 
tutorial, Pig Latin users and reference guides, a cookbook, a UDF writers 
guide, and API javadocs. ||
|| Early error detection and failure|| 0.2  
|| When this was originally added to the !ProposedRoadMap it referred to being 
able to do type checking and other basic semantic checks. ||
|| Remove automatic string encoding || 0.2  
|| ||
|| Add ORDER BY DESC|| 0.2  
|| ||
|| Add LIMIT|| 0.2  
|| ||
|| Add support for NULL values  || 0.2  
|| ||
|| Types beyond String  || 0.2  
|| ||
|| Multiquery support   || 0.3  
|| ||
|| Add skewed join  || 0.4  
|| ||
|| Add merge join   || 0.4  
|| ||
|| Support Hadoop 0.20  || 0.5  
|| ||
|| Improved Sampling|| 0.6  
|| There is still room for improvement for order by sampling ||
|| Change bags to spill after reaching fixed size   || 0.6  
|| Also created bag backed by Hadoop iterator for single UDF cases ||
|| Add Accumulator interface for UDFs   || 0.6  
|| ||
|| Switch local mode to Hadoop local mode   || 0.6  
|| ||
|| Outer join for default, fragment-replicate, skewed   || 0.6  
|| ||
|| Make configuration available to UDFs || 0.6  
|| ||

== Work in Progress ==
This covers work that is currently being done.  For each entry the main JIRA 
for the work is referenced.

|| Feature  || JIRA 
  || Comments ||
|| Metadata || 
[[http://issues.apache.org/jira/browse/PIG-823|PIG-823]]   || ||
|| Query Optimizer  || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || ||
|| Load Store Redesign  || 
[[http://issues.apache.org/jira/browse/PIG-966|PIG-966]]   || ||
|| Add SQL Support  || 
[[http://issues.apache.org/jira/browse/PIG-824|PIG-824]]   || ||
|| Change Pig internal representation of charrarry to Text  || 
[[http://issues.apache.org/jira/browse/PIG-1017|PIG-1017]] || Patch ready, 
unclear when to commit to minimize disruption to users and destabilization to 
code base. ||
|| Integration with Zebra   || 
[[http://issues.apache.org/jira/browse/PIG-833|PIG-833]]   || ||


== Proposed Future Work ==
Work that the Pig project proposes to do in the future is further broken into 
three categories:
 1. Work that we are agreed needs to be done, and also the approach to the work 
is generally agreed upon, but we have not gotten to it yet
 2. Work that we are agreed needs to be done, but the approach is not yet clear 
or there is not general agreement as to which approach is best
 3. Experimental, which includes features that

[Pig Wiki] Update of "PigTools" by AlanGates

2010-01-14 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigTools" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigTools?action=diff&rev1=13&rev2=14

--

  
  'hamake' utility allows you to automate incremental processing of datasets 
stored on HDFS using Hadoop tasks written in Java or using PigLatin scripts.
  
+ === Piglet ===
+ http://github.com/iconara/piglet
+ 
+ Piglet is a DSL for writing Pig Latin scripts in Ruby.  Piglet aims to look 
like Pig Latin while allowing for things like loops and control of flow that 
are missing from Pig.
+ 
+ 
  === PigPen ===
  http://issues.apache.org/jira/browse/PIG-366

1 2 3 4 5 6 7 8 >

1 - 100 of 729 matches

Mail list logo