[Pig Wiki] Update of "FrontPage" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/FrontPage -- [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig Training] - Complete with video lectures, exercises, and a pre-configured virtual machine. Developed by Cloudera and Yahoo! - Pig Latin Editors, Pig Python wrappers, Pig available on Amazon, and other tools, see PigTools - == Developer Documentation == * How tos * HowToDocumentation
[Pig Wiki] Update of "FrontPage" by GregStein
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by GregStein: http://wiki.apache.org/pig/FrontPage The comment on the change is: Restore useful information. And it shouldn't be just a vendor link. -- http://hadoop.apache.org/pig/ + (./) Check it out ... updates and new additions. + + * New to Pig? Getting Started ... + 1. PigOverview - An overview of Pig's capabilities + 1. [http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html Pig Quick Start] - How to build and run Pig + 1. [http://hadoop.apache.org/pig/docs/r0.3.0/tutorial.html Pig Tutorial]- Tackle a real task with pig, start to finish - [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig Training] - Complete with video lectures, exercises, and a pre-configured virtual machine. Developed by Cloudera and Yahoo! + 1. [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig Training] - Complete with video lectures, exercises, and a pre-configured virtual machine. Developed by Cloudera and Yahoo! + + * Pig Language + + * [http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html Pig Latin Reference Manual] - Includes Pig Latin, built-in functions, and shell commands + + * Pig Functions + * PiggyBank - User-defined functions (UDFs) contributed by Pig users! + * [http://hadoop.apache.org/pig/docs/r0.3.0/udf.html UDF Manual] - Write your own UDFs + + * (./) Pig Latin Editors, Pig Python wrappers, and other tools, see PigTools + + * More Pig + * [http://hadoop.apache.org/pig/docs/r0.3.0/cookbook.html Apache Pig Cookbook] - Want Pig to fly? Tips and tricks on how to write efficient Pig scripts + * [http://hadoop.apache.org/pig/javadoc/docs/api/ Javadocs] - Refer to the Javadocs for embedded Pig and UDFs + * [http://wiki.apache.org/pig/FAQ FAQ] - The answer to your question may be here + == Developer Documentation == * How tos
[Pig Wiki] Update of "FrontPage" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/FrontPage The comment on the change is: Removing duplicate links to the documentation per discussion on the user list. -- == User Documentation == + * [http://hadoop.apache.org/pig/ User Documentation] - http://hadoop.apache.org/pig/ - - (./) Check it out ... updates and new additions. - - * New to Pig? Getting Started ... - 1. PigOverview - An overview of Pig's capabilities - 1. [http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html Pig Quick Start] - How to build and run Pig - 1. [http://hadoop.apache.org/pig/docs/r0.3.0/tutorial.html Pig Tutorial]- Tackle a real task with pig, start to finish - 1. [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig Training] - Complete with video lectures, exercises, and a pre-configured virtual machine. Developed by Cloudera and Yahoo! + * [http://www.cloudera.com/hadoop-training-pig-introduction Online Pig Training] - Complete with video lectures, exercises, and a pre-configured virtual machine. Developed by Cloudera and Yahoo! - - * Pig Language - - * [http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html Pig Latin Reference Manual] - Includes Pig Latin, built-in functions, and shell commands - - * Pig Functions - * PiggyBank - User-defined functions (UDFs) contributed by Pig users! + * PiggyBank - User-defined functions (UDFs) contributed by Pig users! - * [http://hadoop.apache.org/pig/docs/r0.3.0/udf.html UDF Manual] - Write your own UDFs - - * (./) Pig Latin Editors, Pig Python wrappers, and other tools, see PigTools - - * More Pig - * [http://hadoop.apache.org/pig/docs/r0.3.0/cookbook.html Apache Pig Cookbook] - Want Pig to fly? Tips and tricks on how to write efficient Pig scripts - * [http://hadoop.apache.org/pig/javadoc/docs/api/ Javadocs] - Refer to the Javadocs for embedded Pig and UDFs - * [http://wiki.apache.org/pig/FAQ FAQ] - The answer to your question may be here - == Developer Documentation == * How tos
[Pig Wiki] Update of "FrontPage" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/FrontPage -- '''Interested in Pig Guts?''' We are completely redesigning the Pig execution and optimization framework. For design details see PigOptimizationWishList and PigExecutionModel. '''Want to contribute but don't know where to kick in?''' Here is a [http://wiki.apache.org/pig/ProposedProjects list of project] that we desired. We need new blood! + + '''Pig available as part of Amazon's Elastic !MapReduce''', as of August 2009. == General Information ==
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecification" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification -- ||2174||Internal exception. Could not create the sampler job. || ||2175||Internal error. Could not retrieve file size for the sampler. || ||2176||Error processing right input during merge join|| + ||2177||Prune column optimization: Cannot retrieve operator from null or empty list|| + ||2178||Prune column optimization: The matching node from the optimizor framework is null|| + ||2179||Prune column optimization: Error while performing checks to prune columns.|| + ||2180||Prune column optimization: Only LOForEach and LOSplit are expected|| + ||2181||Prune column optimization: Unable to prune columns.|| + ||2182||Prune column optimization: Only relational operator can be used in column prune optimization.|| + ||2183||Prune column optimization: LOLoad must be the root logical operator.|| + ||2184||Prune column optimization: Fields list inside RequiredFields is null.|| + ||2185||Prune column optimization: Unable to prune columns when processing node|| + ||2186||Prune column optimization: Cannot locate node from successor|| + ||2187||Column pruner: Cannot get predessors|| + ||2188||Column pruner: Cannot prune columns|| + ||2189||Column pruner: Expect schema|| + ||2190||PruneColumns: Cannot find predecessors for logical operator|| + ||2191||PruneColumns: No input to prune|| + ||2192||PruneColumns: Column to prune does not exist|| + ||2193||PruneColumns: Foreach can only have 1 predecessor|| + ||2194||PruneColumns: Expect schema|| + ||2195||PruneColumns: Fail to visit foreach inner plan|| + ||2196||RelationalOperator: Exception when traversing inner plan|| + ||2197||RelationalOperator: Cannot drop column which require *|| + ||2198||LOLoad: load only take 1 input|| + ||2199||LOLoad: schema mismatch|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "LoadStoreRedesignProposal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/LoadStoreRedesignProposal New page: = Proposed Redesign For Load, Store, and Slicer in Pig = == Goals == The current design of !LoadFunc, !StoreFunc, and the Slicer interfaces in Pig are not adequate. This proposed redesign has the following goals: 1. The Slicer interface is redundant. Remove it and allow users to directly use Hadoop !InputFormats in Pig. 1. It is not currently easy to use a separate !OutputFormat for a !StoreFunc. This should be made easy to allow users to store data into locations other than HDFS. 1. Currently users that wish to operate on Pig and Map-Reduce are required to write Hadoop !InputFormat and !OutpuFormat as well as a Pig loader and Pig storage functions. While Pig load and store functions will always be necessary to take the most advantage of Pig, it would be good for users to be able to use Hadoop !InputFormat and !OutputFormat classes directly to minimize the data interchange cost. 1. The major difference between a Hadoop !InputFormat and a Pig load function is the data model. Hadoop views data as key-value pairs, Pig as a tuple. Similarly for !OutputFormat and store functions. 1. New storage formats such as Zebra are being implemented for Hadoop that include metadata information such as schema, etc. The !LoadFunc interface needs to allow Pig to obtain this metadata. There is a describeSchema call in the current interface. More functions may be necessary. 1. These new storage formats also plan to support pushing of, at least, projection and selection into the storage layer. Pig needs to be able to query loaders to determine what if any pushdown capabilities they support and then make use of those capabilities. 1. There already exists one metadata system in Hadoop (Hive's metastore) and there is a proposal to add another (Owl). Pig needs to be able to query these metadata systems for information about data to be read. It also needs to be able to record information to these metadata systems when writing data. The load and store functions are a reasonable place to do these operations since that is the point at which Pig is reading and writing data. This will also allow Pig to read and write data from and to multiple metadata stores in single Pig Latin scripts if that is desired. A requirement for the implementation that does not fit into the goals above is that while the existing Pig implementation is tightly tied to Hadoop (and is becoming more tightly tied all the time), we do not want to tie Pig Latin tightly to Hadoop. Therefore while we plan to allow users to easily interact with Hadoop !InputFormats and !OutputFormats, these should not be exposed as such to Pig Latin. Pig Latin must still view these as load and store functions; it will only be the underlying implementation that will realize that they are Hadoop classes and handle them appropriately. == Interfaces == With these proposed changes, load and store functions in Pig are becoming very weighty objects. The current !LoadFunc interface already provides mechanisms for reading the data, getting some schema information, casting data, and some place holders for pushing down projections into the loader. This proposal will add more file level metadata, global metadata, selection push down, plus interaction with !InputFormats. It will also add !OutputFormats to store functions. If we create two monster interfaces that attempt to provide everything, the burden of creating a new load or store function in Pig will become overwhelming. Instead, this proposal envisions splitting the interface into a number of interfaces, each with a clear responsibility. Load and store functions will then only be required to implement the interfaces for functionality they offer. For load functions: * !LoadFunc will be pared down to just contain functions directly associated with reading data, such as getNext. * A new !LoadCaster interface will be added. This interface will contain all of the bytesToX methods currently in !LoadFunc. !LoadFunc will add a getCaster routine, that will return an object that can provide casts. The existing UTF8!StorageConverter class will change to implement this interface. Load functions will then be free to use this class as their caster, or provide their own. For existing load functions that provide all of the bytesToX methods, they can implement the !LoadCaster interface and return themselves from the getCaster routine. If a loader does not provide a !LoadCaster, casts from byte array to other pig types will not be supported for data loaded via that loader. * A new !LoadMetadata interface will be added. Calls that find metadata about the data being loaded, such as determineSchema, will be placed in this interface. If a loader does not im
[Pig Wiki] Update of "MetadataInterfaceProposal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/MetadataInterfaceProposal New page: = Proposed Design for Pig Metadata Interface = With the introduction of SQL, Pig needs to be able to communicate with external metadata services. These communications will includes such operations as creating, altering, and dropping databases, tables, etc. It will also include metadata queries, such as requests to show available tables, etc. DDL operations of these sorts will be beyond the scope of the proposed metadata interfaces for load and storage functions. However, Pig should not be tightly tied to a single metadata implementation. It should be able to work with Owl, Hive's metastore, or any other metadata source that is added to Hadoop. To this end this document proposes an interface for operating with metadata systems. Different metadata connectors can then be implemented, one for each metadata system. == Interface == This interface will allow users to find information about tables, databases, etc. in the metadata store. For each call, it will pass the portion of the syntax tree relavant to the operation to the metadata connector. These structures will be versioned. {{{ /** * An interface to encapsulate DDL operations. */ interface MetadataDDL { void createTable(CreateTable ct) throws IOException; void alterTable(AlterTable at) throws IOException; // includes add and drop partition void dropTable(DropTable dt) throws IOException; SQLTable[] showTables(Database db) throws IOException; // info returned in SQLTable includes info on partitions void createDatabase(CreateDatabase cd) throws IOException; void alterDatabase(AlterDatabase ad) throws IOException; void dropDatabase(DropDatabase dd) throws IOException; SQLDatabase[] showDatabases() throws IOException; } }}} == Accessing Global Metadata From SQL == Pig will be configured to work with one global metadata source for a given set of SQL operations. This configuration will be via Pig's configuration file. It will specify the URI of the server to use and the implementation of !MetadataDDL to use with this server. == Accessing Global Metadata from Pig Latin == Pig Latin will not support a call to metadata within the language itself. Instead, it will support the ability to invoke a SQL DDL command. This SQL will then be sent to the SQL parser and dispatched through the metadata service as before. {{{ A = load ... ... SQL {"create table myTable ..."}; store Z into 'myTable' using OwlStorage(); }}}
[Pig Wiki] Update of "LoadStoreRedesignProposal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/LoadStoreRedesignProposal -- 1. The Slicer interface is redundant. Remove it and allow users to directly use Hadoop !InputFormats in Pig. 1. It is not currently easy to use a separate !OutputFormat for a !StoreFunc. This should be made easy to allow users to store data into locations other than HDFS. - 1. Currently users that wish to operate on Pig and Map-Reduce are required to write Hadoop !InputFormat and !OutpuFormat as well as a Pig loader and Pig storage functions. While Pig load and store functions will always be necessary to take the most advantage of Pig, it would be good for users to be able to use Hadoop !InputFormat and !OutputFormat classes directly to minimize the data interchange cost. + 1. Currently users that wish to operate on Pig and Map-Reduce are required to write Hadoop !InputFormat and !OutputFormat as well as a Pig load and storage functions. While Pig load and store functions will always be necessary to take the most advantage of Pig, it would be good for users to be able to use Hadoop !InputFormat and !OutputFormat classes directly to minimize the data interchange cost. - 1. The major difference between a Hadoop !InputFormat and a Pig load function is the data model. Hadoop views data as key-value pairs, Pig as a tuple. Similarly for !OutputFormat and store functions. 1. New storage formats such as Zebra are being implemented for Hadoop that include metadata information such as schema, etc. The !LoadFunc interface needs to allow Pig to obtain this metadata. There is a describeSchema call in the current interface. More functions may be necessary. 1. These new storage formats also plan to support pushing of, at least, projection and selection into the storage layer. Pig needs to be able to query loaders to determine what if any pushdown capabilities they support and then make use of those capabilities. 1. There already exists one metadata system in Hadoop (Hive's metastore) and there is a proposal to add another (Owl). Pig needs to be able to query these metadata systems for information about data to be read. It also needs to be able to record information to these metadata systems when writing data. The load and store functions are a reasonable place to do these operations since that is the point at which Pig is reading and writing data. This will also allow Pig to read and write data from and to multiple metadata stores in single Pig Latin scripts if that is desired. @@ -22, +21 @@ == Interfaces == With these proposed changes, load and store functions in Pig are becoming very weighty objects. The current !LoadFunc interface already provides mechanisms for reading the data, getting some schema information, casting data, and some place holders for pushing down projections into - the loader. This proposal will add more file level metadata, global metadata, selection push down, plus interaction with !InputFormats. It will + the loader. This proposal will add more file level metadata, selection push down, plus interaction with !InputFormats. It will also add !OutputFormats to store functions. If we create two monster interfaces that attempt to provide everything, the burden of creating a new load or store function in Pig will become overwhelming. Instead, this proposal envisions splitting the interface into a number of interfaces, each with a clear responsibility. Load and store functions will then only be required to implement the interfaces for functionality they offer. For load functions: - * !LoadFunc will be pared down to just contain functions directly associated with reading data, such as getNext. + * '''!LoadFunc''' will be pared down to just contain functions directly associated with reading data, such as getNext. - * A new !LoadCaster interface will be added. This interface will contain all of the bytesToX methods currently in !LoadFunc. !LoadFunc will add a getCaster routine, that will return an object that can provide casts. The existing UTF8!StorageConverter class will change to implement this interface. Load functions will then be free to use this class as their caster, or provide their own. For existing load functions that provide all of the bytesToX methods, they can implement the !LoadCaster interface and return themselves from the getCaster routine. If a loader does not provide a !LoadCaster, casts from byte array to other pig types will not be supported for data loaded via that loader. + * A new '''!LoadCaster''' interface will be added. This interface will contain all of the bytesToX methods currently in !LoadFunc. !LoadFunc will add a `getCaster` routine, that will return an object that
[Pig Wiki] Update of "MetadataInterfaceProposal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/MetadataInterfaceProposal -- = Proposed Design for Pig Metadata Interface = - With the introduction of SQL, Pig needs to be able to communicate with external metadata services. These communications will includes such + With the introduction of SQL, Pig needs to be able to communicate with external metadata services. These communications will include such operations as creating, altering, and dropping databases, tables, etc. It will also include metadata queries, such as requests to show available tables, etc. DDL operations of these sorts will be beyond the scope of the proposed metadata interfaces for load and storage functions. However, Pig should not be tightly tied to a single metadata implementation. It should be able to - work with Owl, Hive's metastore, or any other metadata source that is added to Hadoop. To this end this document proposes an interface for + work with Owl, Hive's metastore, or any other metadata source that is added to Hadoop. To this end, this document proposes an interface for operating with metadata systems. Different metadata connectors can then be implemented, one for each metadata system. == Interface == - This interface will allow users to find information about tables, databases, etc. in the metadata store. For each call, it will pass the portion of the syntax tree relavant to the operation to the metadata connector. These structures will be versioned. @@ -39, +38 @@ configuration file. It will specify the URI of the server to use and the implementation of !MetadataDDL to use with this server. == Accessing Global Metadata from Pig Latin == - Pig Latin will not support a call to metadata within the language itself. Instead, it will support the ability to invoke a SQL DDL command. + Pig Latin will not support a call to metadata within the language itself. Instead, it will support the ability to invoke a Pig SQL DDL command. This SQL will then be sent to the SQL parser and dispatched through the metadata service as before. {{{
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecification" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification -- ||2197||RelationalOperator: Cannot drop column which require *|| ||2198||LOLoad: load only take 1 input|| ||2199||LOLoad: schema mismatch|| + ||2200||PruneColumns: Error getting top level project|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by AlanGates: http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=3&rev2=4 * @param reader RecordReader to be used by this instance of the LoadFunc */ void prepareToRead(RecordReader reader); + + /** + * Called after all reading is finished. + */ + void doneReading(); /** * Retrieves the next tuple to be processed. @@ -289, +294 @@ void prepareToWrite(RecordWriter writer); /** + * Called when all writing is finished. + */ + void doneWriting(); + + /** * Write a tuple the output stream to which this instance was * previously bound. *
[Pig Wiki] Update of "ProposedRoadMap" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "ProposedRoadMap" page has been changed by AlanGates: http://wiki.apache.org/pig/ProposedRoadMap?action=diff&rev1=3&rev2=4 - <> = Pig Road Map = The following document was developed as a roadmap for pig at Yahoo prior to pig being released as open source.
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by AlanGates: http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=4&rev2=5 interface LoadFunc { /** - * Communicate to the loader the URIs used in Pig Latin to refer to the + * Communicate to the loader the load string used in Pig Latin to refer to the * object(s) being loaded. That is, if the PL script is * A = load 'bla' - * then 'bla' is the URI. Load functions should assume that if no - * scheme is provided in the URI it is an hdfs file. This will be + * then 'bla' is the load string. In general Pig expects these to be + * a path name, a glob, or a URI. If there is no URI scheme present, + * Pig will assume it is a file name. This will be * called during planning on the front end, not during execution on * the backend. - * @param uri URIs referenced in load statement. + * @param location Location indicated in load statement. + * @throws IOException if the location is not valid. */ - void setURI(URI[] uri); + void setLocation(String location) throws IOException; /** * Return the InputFormat associated with this loader. This will be * called during planning on the front end. The LoadFunc need not * carry the InputFormat information to the backend, as it will - * be provided with the appropriate RecordReader there. + * be provided with the appropriate RecordReader there. This is the + * instance of InputFormat (rather than the class name) because the + * load function may need to instantiate the InputFormat in order + * to control how it is constructed. */ InputFormat getInputFormat(); @@ -77, +82 @@ /** * Initializes LoadFunc for reading data. This will be called during execution - * before any calls to getNext. + * before any calls to getNext. The RecordReader needs to be passed here because + * it has been instantiated for a particular InputSplit. * @param reader RecordReader to be used by this instance of the LoadFunc */ void prepareToRead(RecordReader reader); @@ -100, +106 @@ }}} Open questions for !LoadFunc: - 1. Should setURI instead be setLocation and just take a String? The advantage of a URI is we know exactly what users are trying to communicate with, and we can define what Pig does in default cases (when a scheme is not given). The disadvantage is forcing more structure on users and their load functions. I'm still pretty strongly on the side of using URI. + 1. Should setLocation instead be setURI and take a URI? The advantage of a URI is we know exactly what users are trying to communicate with, and we can define what Pig does in default cases (when a scheme is not given). The disadvantage is forcing more structure on users and their load functions. The '''!LoadCaster''' interface will include bytesToInt, bytesToLong, etc. functions currently in !LoadFunc. UTF8!StorageConverter will implement this interface. @@ -121, +127 @@ * not possible to return a schema that represents all returned data, * then null should be returned. */ - LoadSchema getSchema(); + ResourceSchema getSchema(); /** * Get statistics about the data to be loaded. If no statistics are * available, then null should be returned. */ - LoadStatistics getStatistics(); + ResourceStatistics getStatistics(); + + /** + * Find what columns are partition keys for this input. + * This function assumes that setLocation has already been called. + * @return array of field names of the partition keys. + */ + String>[] getPartitionKeys(); + + /** + * Set the filter for partitioning. It is assumed that this filter + * will only contain references to fields given as partition keys in + * getPartitionKeys + * @param plan that describes filter for partitioning + * @throws IOException if the filter is not compatible with the storage + * mechanism or contains non-partition fields. + */ + void setParitionFilter(OperatorPlan plan) throws IOException; } }}} - '''!LoadSchema''' will be a top level object (`org.apache.pig.LoadSchema`) used to communicate information about data to be loaded or that is being + '''!ResourceSchema''' will be a top level object (`org.apache.pig.ResourceSchema`) used to communicate information about data to be loaded or that is being stored. It is not the same as the existing `org.apache.pig.impl.logicalLayer.schema.Schema`. {{{ - public class LoadSchema { + public class ResourceSchema { int version; - public class LoadFieldSchema { + public class ResourceFieldSchem
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=122&rev2=123 ||2182||Prune column optimization: Only relational operator can be used in column prune optimization.|| ||2183||Prune column optimization: LOLoad must be the root logical operator.|| ||2184||Prune column optimization: Fields list inside RequiredFields is null.|| - ||2185||Prune column optimization: Unable to prune columns when processing node|| ||2186||Prune column optimization: Cannot locate node from successor|| ||2187||Column pruner: Cannot get predessors|| ||2188||Column pruner: Cannot prune columns||
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by AlanGates: http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=5&rev2=6 void prepareToWrite(RecordWriter writer); /** - * Called when all writing is finished. + * Called when all writing is finished. This will be called on the backend, + * once for each writing task. */ void doneWriting(); @@ -330, +331 @@ * @throws IOException */ void putNext(Tuple t) throws IOException; + + /** + * Called when writing all of the data is finished. This can be used + * to commit information to a metadata system, clean up tmp files, + * close connections, etc. This call will be made on the front end + * after all back end processing is finished. + */ + void allFinished(); + + } @@ -461, +472 @@ == Changes == Sept 23 2009, Gates * Changed setURI to setLocation in !LoadFunc and !StoreFunc. Also changed it to throw IOException in the cases where the passed in location is not valid for this load or store mechanism. - * Changed LoadSchema to ResourceSchema and LoadStatistics to ResourceStatistics + * Changed !LoadSchema to !ResourceSchema and !LoadStatistics to !ResourceStatistics - * Added getPartitionKeys and setPartitionFilter to LoadMetadata + * Added getPartitionKeys and setPartitionFilter to !LoadMetadata + Sept 25 2009, Gates + * Added allFinished call to !StoreFunc +
[Pig Wiki] Update of "MetadataInterfaceProposal" by Ala nGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "MetadataInterfaceProposal" page has been changed by AlanGates: http://wiki.apache.org/pig/MetadataInterfaceProposal?action=diff&rev1=3&rev2=4 void alterDatabase(AlterDatabase ad) throws IOException; void dropDatabase(DropDatabase dd) throws IOException; SQLDatabase[] showDatabases() throws IOException; + + /** + * Get the default load function for this metadata service. This + * will be called by SQL to determine the right load function for + * the metadata service it is connected to. + * @return class name of the default load function for this interface. + */ + String getLoaderClass(); + + /** + * Get the default storage function for this metadata service. This + * will be called by SQL to determine the right storage function for + * the metadata service it is connected to. + * @return class name of the default storage function for this interface. + */ + String getStorageClass(); + + } @@ -48, +66 @@ store Z into 'myTable' using OwlStorage(); }}} + == Changes == + Septemer 25 2009 + * Added getLoaderClass and getStorageClass to interface, Gates. +
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by AlanGates: http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=6&rev2=7 type conversion on this data will be done in the same way as noted above for !InputFormatLoader. Open Questions: - 1. Does all this force us to switch to Hadoop for local mode as well? We aren't opposed to using Hadoop for local mode it just needs to get reasonable fast. Can we use !InputFormat ''et. al.'' on local files without using the whole HDFS structure? + 1. Does all this force us to switch to Hadoop for local mode as well? We aren't opposed to using Hadoop for local mode it just needs to get reasonable fast. Can we use !InputFormat ''et. al.'' on local files without using the whole HDFS structure? '''Answer''' According to Hadoop documentation !TextInputFormat works on local files as well as hdfs files. We may need to catch that we are in local mode and change the filename to `file://` + 1. How will we worked with compressed files? !FileInputFormat already works with bzip and gzip compressed files, producing reasonable splits. !PigStorage will be reworked to depend on !FileInputFormat (or a descendant thereof, see next item) and should therefore be able to use this functionality. + 1. How will the need for mark and seek in index construction for merge join be handled? In the long term we'd like Hadoop to handle this for us by creating a !SeekableInputFormat that would add this functionality. In the meantime we can extend !FileInputFormat to !PigFileInputFormat. We can add getPos() call to this class that will provide a position to start reading at to find the tuple being indexed. Note that this position will not necessarily be the exact position of the tuple, but a position from which the tuple can be found. We can also change the getSplits call on this method to return a split that is specific to a given position so that it can be used during the join. == Changes == Sept 23 2009, Gates @@ -478, +480 @@ Sept 25 2009, Gates * Added allFinished call to !StoreFunc + Sept 29 2009, Gates + * Added answer for open question 1. Added and answered open questions 2 and 3. +
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=123&rev2=124 ||1103||Merge join only supports Filter, Foreach and Load as its predecessor. Found : || ||1104||Right input of merge-join must implement SamplableLoader interface. This loader doesn't implement it.|| ||1105||Heap percentage / Conversion factor cannot be set to 0 || + ||1106||Merge join is possible only for simple column or '*' join keys when using as the loader || ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=124&rev2=125 ||2198||LOLoad: load only take 1 input|| ||2199||LOLoad: schema mismatch|| ||2200||PruneColumns: Error getting top level project|| + ||2201||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=125&rev2=126 ||2198||LOLoad: load only take 1 input|| ||2199||LOLoad: schema mismatch|| ||2200||PruneColumns: Error getting top level project|| - ||2201||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin|| @@ -420, +419 @@ ||4007||Missing from hadoop configuration|| ||4008||Failed to create local hadoop file || ||4009||Failed to copy data to local hadoop file || + ||4010||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||6000||The output file(s): already exists|| ||6001||Cannot read from the storage where the output will be stored|| ||6002||Unable to obtain a temporary path.||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=126&rev2=127 ||1104||Right input of merge-join must implement SamplableLoader interface. This loader doesn't implement it.|| ||1105||Heap percentage / Conversion factor cannot be set to 0 || ||1106||Merge join is possible only for simple column or '*' join keys when using as the loader || + ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists|| @@ -419, +420 @@ ||4007||Missing from hadoop configuration|| ||4008||Failed to create local hadoop file || ||4009||Failed to copy data to local hadoop file || - ||4010||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||6000||The output file(s): already exists|| ||6001||Cannot read from the storage where the output will be stored|| ||6002||Unable to obtain a temporary path.||
[Pig Wiki] Update of "FrontPage" by yinghe
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "FrontPage" page has been changed by yinghe: http://wiki.apache.org/pig/FrontPage?action=diff&rev1=142&rev2=143 * PigErrorHandling * PigMultiQueryPerformanceSpecification * PigSkewedJoinSpec + * PigAccumulatorSpec * PigSampler * Performance * PigPerformance (current performance numbers)
[Pig Wiki] Update of "FrontPage" by yinghe
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "FrontPage" page has been changed by yinghe: http://wiki.apache.org/pig/FrontPage?action=diff&rev1=143&rev2=144 * PigErrorHandling * PigMultiQueryPerformanceSpecification * PigSkewedJoinSpec - * PigAccumulatorSpec + * PigAccumulatorUDF * PigSampler * Performance * PigPerformance (current performance numbers)
[Pig Wiki] Update of "FrontPage" by yinghe
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "FrontPage" page has been changed by yinghe: http://wiki.apache.org/pig/FrontPage?action=diff&rev1=144&rev2=145 * PigErrorHandling * PigMultiQueryPerformanceSpecification * PigSkewedJoinSpec - * PigAccumulatorUDF + * PigAccumulatorSpec * PigSampler * Performance * PigPerformance (current performance numbers)
[Pig Wiki] Update of "MarkMeissonnier" by MarkMeissonni er
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "MarkMeissonnier" page has been changed by MarkMeissonnier: http://wiki.apache.org/pig/MarkMeissonnier New page: #format wiki #language en == Mark Meissonnier == I am a software engineer who arrived in Silicon Valley on March 2000 (which for the anecdote was 1 month after Nasdaq hit it's alltime high of 5000 points...What is it today?) CategoryHomepage
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=127&rev2=128 ||1105||Heap percentage / Conversion factor cannot be set to 0 || ||1106||Merge join is possible only for simple column or '*' join keys when using as the loader || ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| + ||1108||Duplicated schema|| - ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || + ||20008||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists|| ||2003||Cannot read from the storage where the output will be stored||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=128&rev2=129 ||1106||Merge join is possible only for simple column or '*' join keys when using as the loader || ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||1108||Duplicated schema|| - ||20008||Internal error. Mismatch in group by arities. Expected: . Found: || + ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists|| ||2003||Cannot read from the storage where the output will be stored||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy: http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=129&rev2=130 ||2198||LOLoad: load only take 1 input|| ||2199||LOLoad: schema mismatch|| ||2200||PruneColumns: Error getting top level project|| + ||2201||Could not validate schema alias|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath: http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=7&rev2=8 * to commit information to a metadata system, clean up tmp files, * close connections, etc. This call will be made on the front end * after all back end processing is finished. + * @param conf The job configuration */ - void allFinished(); + void allFinished(Configuration conf);
[Pig Wiki] Update of "PigMix" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigMix" page has been changed by AlanGates. http://wiki.apache.org/pig/PigMix?action=diff&rev1=11&rev2=12 -- || PigMix_12 || 156 || 160.67 || 0.97 || || Total || 2440.67 || 2001.67 || 1.22 || + Run date: October 18, 2009, run against top of trunk as of that day. + With this run we included a new measure, weighted average. Our previous multiplier that we have been publishing takes the total time of running all 12 Pig Latin scripts and compares it to the total time of + running all 12 Java Map Reduce programs. This is a valid way to measure, as it shows the total amount of time to do all these operations on both platforms. But it has the drawback that it gives more weight to + long running operations (such as joins and order bys) while masking the performance in faster operations such as group bys. The new "weighted average" adds up the multiplier for each Pig Latin script vs. Java + program separately and then divides by 12, thus weighting each test equally. In past runs the weighted average had significantly lagged the overall average (for example, in the run above for August 27 it + was 1.5 even though the total difference was 1.2). With this latest run it still lags some, but the gap has shrunk noticably. + + || Test || Pig run time || Java run time || Multiplier || + || PigMix_1 || 135.0|| 133.0 || 1.02 || + || PigMix_2 || 46.67|| 39.33 || 1.19 || + || PigMix_3 || 184.0|| 98.0 || 1.88 || + || PigMix_4 || 71.67|| 77.67 || 0.92 || + || PigMix_5 || 70.0 || 83.0 || 0.84 || + || PigMix_6 || 76.67|| 61.0 || 1.26 || + || PigMix_7 || 71.67|| 61.0 || 1.17 || + || PigMix_8 || 43.33|| 47.67 || 0.91 || + || PigMix_9 || 184.0|| 209.33|| 0.88 || + || PigMix_10 || 268.67 || 283.0 || 0.95 || + || PigMix_11 || 145.33 || 168.67|| 0.86 || + || PigMix_12 || 55.33|| 95.33 || 0.58 || + || Total || 1352.33 || 1357 || 1.00 || + Weighted Average: 1.04 == Features Tested ==
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=8&rev2=9 -- '''!LoadFunc''' {{{ + /** * This interface is used to implement functions to parse records * from a dataset. */ - interface LoadFunc { + public interface LoadFunc { + /** + * This method is called by the Pig runtime in the front end to convert the + * input location to an absolute path if the location is relative. The + * loadFunc implementation is free to choose how it converts a relative + * location to an absolute location since this may depend on what the location + * string represent (hdfs path or some other data source) + * + * @param location location as provided in the "load" statement of the script + * @param curDir the current working direction based on any "cd" statements + * in the script before the "load" statement + * @return the absolute location based on the arguments passed + * @throws IOException if the conversion is not possible + */ + String relativeToAbsolutePath(String location, String curDir) throws IOException; /** * Communicate to the loader the load string used in Pig Latin to refer to the - * object(s) being loaded. That is, if the PL script is - * A = load 'bla' - * then 'bla' is the load string. In general Pig expects these to be - * a path name, a glob, or a URI. If there is no URI scheme present, - * Pig will assume it is a file name. This will be - * called during planning on the front end, not during execution on - * the backend. - * @param location Location indicated in load statement. + * object(s) being loaded. The location string passed to the LoadFunc here + * is the return value of {...@link LoadFunc#relativeToAbsolutePath(String, String)} + * + * This method will be called in the backend multiple times. Implementations + * should bear in mind that this method is called multiple times and should + * ensure there are no inconsistent side effects due to the multiple calls. + * + * @param location Location as returned by + * {...@link LoadFunc#relativeToAbsolutePath(String, String)}. + * @param job the {...@link Job} object * @throws IOException if the location is not valid. */ - void setLocation(String location) throws IOException; + void setLocation(String location, Job job) throws IOException; /** + * This will be called during planning on the front end. This is the - * Return the InputFormat associated with this loader. This will be - * called during planning on the front end. The LoadFunc need not - * carry the InputFormat information to the backend, as it will - * be provided with the appropriate RecordReader there. This is the * instance of InputFormat (rather than the class name) because the * load function may need to instantiate the InputFormat in order * to control how it is constructed. + * @return the InputFormat associated with this loader. + * @throws IOException if there is an exception during InputFormat + * construction */ - InputFormat getInputFormat(); + InputFormat getInputFormat() throws IOException; /** + * This will be called on the front end during planning and not on the back + * end during execution. - * Return the LoadCaster associated with this loader. Returning + * @return the {...@link LoadCaster} associated with this loader. Returning null - * null indicates that casts from byte array are not supported + * indicates that casts from byte array are not supported for this loader. - * for this loader. This will be called on the front end during - * planning and not on the back end during execution. + * construction + * @throws IOException if there is an exception during LoadCaster */ - LoadCaster getLoadCaster(); + LoadCaster getLoadCaster() throws IOException; /** * Initializes LoadFunc for reading data. This will be called during execution * before any calls to getNext. The RecordReader needs to be passed here because * it has been instantiated for a particular InputSplit. - * @param reader RecordReader to be used by this instance of the LoadFunc + * @param reader {...@link RecordReader} to be used by this instance of the LoadFunc + * @param split The input {...@link PigSplit} to process + * @throws IOException if there is an exception during initialization */ + void prepareToRead(RecordReader reader, PigSplit split) throws IOException; - void prepareToRe
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by ankit.modi
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by ankit.modi. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=130&rev2=131 -- ||1106||Merge join is possible only for simple column or '*' join keys when using as the loader || ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||1108||Duplicated schema|| + ||1109||Input ( ) on which outer join is desired should have a valid schema|| ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists|| @@ -467, +468 @@ 1. February 11, 2009: Updated "Compendium of error messages" to include new error codes (2116 through 2121, 6015 and 6016) 1. February 12, 2009: Updated "Compendium of error messages" to include new error code 2122 1. April 10, 2009: Updated "Compendium of error messages" to replace error code 2110 +1. November 2, 2009: Updated "Compendium of error messages" to include new error code 1109 == References == 1. <> "Pig Developer Cookbook" October 21, 2008, http://wiki.apache.org/pig/PigDeveloperCookbook
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=9&rev2=10 -- * Added relativeToAbsolutePath() method in LoadFunc per http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818 * Changed comments in setLocation regarding the location passed - the location will now be the return value of relativeToAbsolutePath() * setLocation() now also takes a Job argument since the main purpose of this call is to an opportunity to the LoadFunc implementation to communicate the input location to underlying InputFormat. InputFormat implementations inturn seem to be storing this information inthe Job. For example, FileInputFormat has the following static method to set the input location: setInputPaths(JobConf conf, String commaSeparatedPaths) ; + * Removed doneReading() method since there is already a RecordReader.close() method which will be called by Hadoop wherein all the functionality that needs to be done on completion of reading can be done. * All methods now can throw IOException - this keeps the interface more flexible for exception cases In LoadMetadata:
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=10&rev2=11 -- * @return the absolute location based on the arguments passed * @throws IOException if the conversion is not possible */ - String relativeToAbsolutePath(String location, String curDir) throws IOException; + String relativeToAbsolutePath(String location, Path curDir) throws IOException; /** * Communicate to the loader the load string used in Pig Latin to refer to the
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=11&rev2=12 -- * * @param location location as provided in the "load" statement of the script * @param curDir the current working direction based on any "cd" statements - * in the script before the "load" statement + * in the script before the "load" statement. If there are no "cd" statements + * in the script, this would be the home directory - + * /user/ * @return the absolute location based on the arguments passed * @throws IOException if the conversion is not possible */ String relativeToAbsolutePath(String location, Path curDir) throws IOException; + /** * Communicate to the loader the load string used in Pig Latin to refer to the
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=12&rev2=13 -- '''!StoreFunc''' {{{ + + /** + * This interface is used to implement functions to write records + * from a dataset. + */ + public interface StoreFunc { + + /** + * This method is called by the Pig runtime in the front end to convert the + * output location to an absolute path if the location is relative. The + * StoreFunc implementation is free to choose how it converts a relative + * location to an absolute location since this may depend on what the location + * string represent (hdfs path or some other data source) + * + * @param location location as provided in the "store" statement of the script + * @param curDir the current working direction based on any "cd" statements + * in the script before the "store" statement. If there are no "cd" statements + * in the script, this would be the home directory - + * /user/ + * @return the absolute location based on the arguments passed + * @throws IOException if the conversion is not possible + */ + String relToAbsPathForStoreLocation(String location, Path curDir) throws IOException; /** * Return the OutputFormat associated with StoreFunc. This will be called * on the front end during planning and not on the backend during - * execution. OutputFormat information need not be carried to the back end - * as the appropriate RecordWriter will be provided to the StoreFunc. + * execution. + * @return the {...@link OutputFormat} associated with StoreFunc + * @throws IOException if an exception occurs while constructing the + * OutputFormat - */ + * + */ - OutputFormat getOutputFormat(); + OutputFormat getOutputFormat() throws IOException; /** * Communicate to the store function the location used in Pig Latin to refer @@ -327, +353 @@ * called during planning on the front end, not during execution on * the backend. * @param location Location indicated in store statement. + * @param job The {...@link Job} object * @throws IOException if the location is not valid. */ - void setLocation(String location) throws IOException; + void setStoreLocation(String location, Job job) throws IOException; /** * Set the schema for data to be stored. This will be called on the + * front end during planning. A Store function should implement this function to - * front end during planning. If the store function wishes to record - * the schema it will need to carry it to the backend. - * Even if a store function cannot - * record the schema, it may need to implement this function to * check that a given schema is acceptable to it. For example, it * can check that the correct partition keys are included; * a storage function to be written directly to an OutputFormat can * make sure the schema will translate in a well defined way. - * @param schema to be checked/set + * @param s to be checked - * @throw IOException if this schema is not acceptable. It should include + * @throws IOException if this schema is not acceptable. It should include * a detailed error message indicating what is wrong with the schema. */ - void setSchema(ResourceSchema s) throws IOException; + void checkSchema(ResourceSchema s) throws IOException; /** * Initialize StoreFunc to write data. This will be called during * execution before the call to putNext. * @param writer RecordWriter to use. + * @throws IOException if an exception occurs during initialization */ - void prepareToWrite(RecordWriter writer); + void prepareToWrite(RecordWriter writer) throws IOException; - - /** - * Called when all writing is finished. This will be called on the backend, - * once for each writing task. - */ - void doneWriting(); /** * Write a tuple the output stream to which this instance was * previously bound. * - * @param f the tuple to store. + * @param t the tuple to store. - * @throws IOException + * @throws IOException if an exception occurs during the write */ void putNext(Tuple t) throws IOException; - - /** - * Called when writing all of the data is finished. This can be used - * to commit information to a metadata system, clean up tmp files, - * close connections, etc. This call will be made on the front end - * after all back end processing is finished. - * @param conf The job configurati
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=13&rev2=14 -- Nov 2 2009, Pradeep Kamath - In LoadFunc: + In !LoadFunc: - * Added relativeToAbsolutePath() method in LoadFunc per http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818 + * Added relativeToAbsolutePath() method in !LoadFunc per http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818 * Changed comments in setLocation regarding the location passed - the location will now be the return value of relativeToAbsolutePath() - * setLocation() now also takes a Job argument since the main purpose of this call is to an opportunity to the LoadFunc implementation to communicate the input location to underlying InputFormat. InputFormat implementations inturn seem to be storing this information inthe Job. For example, FileInputFormat has the following static method to set the input location: setInputPaths(JobConf conf, String commaSeparatedPaths) ; + * setLocation() now also takes a Job argument since the main purpose of this call is to an opportunity to the !LoadFunc implementation to communicate the input location to underlying !InputFormat. !InputFormat implementations inturn seem to be storing this information inthe Job. For example, !FileInputFormat has the following static method to set the input location: setInputPaths(JobConf conf, String commaSeparatedPaths) ; * Removed doneReading() method since there is already a RecordReader.close() method which will be called by Hadoop wherein all the functionality that needs to be done on completion of reading can be done. * All methods now can throw IOException - this keeps the interface more flexible for exception cases - In LoadMetadata: + In !LoadMetadata: * getSchema(), getStatistics() and getPartitionKeys() methods now take a location and Configuration argument so that the implementation can use that information in returning the information requested. - In StoreFunc: + In !StoreFunc: - * Added relativeToAbsolutePath() method per http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818 + * Added relToAbsPathForStoreLocation() method per http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818 * Methods which did not throw IOException now do so to enable exceptions in implementations - * Removed doneWriting() - same functionality already present in RecordWriter.close() and OutputCommitter.commitTask() + * Removed doneWriting() - same functionality already present in !RecordWriter.close() and !OutputCommitter.commitTask() * Changed setSchema() to checkSchema since this method is called only to allow StoreFunc to check - * Removed allFinished() - same functionality already present in OutputCommitter.cleanupJob() + * Removed allFinished() - same functionality already present in !OutputCommitter.cleanupJob()
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=14&rev2=15 -- }}} - Open questions for !LoadFunc: - 1. Should setLocation instead be setURI and take a URI? The advantage of a URI is we know exactly what users are trying to communicate with, and we can define what Pig does in default cases (when a scheme is not given). The disadvantage is forcing more structure on users and their load functions. - The '''!LoadCaster''' interface will include bytesToInt, bytesToLong, etc. functions currently in !LoadFunc. UTF8!StorageConverter will implement this interface. + + Open Question: Should the methods to convert to a Bag, Tuple and Map take a Schema (ResourceSchema?) argument? + '''!LoadMetadata''' {{{
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=15&rev2=16 -- The '''!LoadCaster''' interface will include bytesToInt, bytesToLong, etc. functions currently in !LoadFunc. UTF8!StorageConverter will implement this interface. - Open Question: Should the methods to convert to a Bag, Tuple and Map take a Schema (ResourceSchema?) argument? + '''Open Question''': Should the methods to convert to a Bag, Tuple and Map take a Schema (ResourceSchema?) argument? '''!LoadMetadata''' @@ -425, +425 @@ result. Since Pig still needs to add information to !InputSplits, user provided !InputFormats and !InputSplits cannot be used directly. Instead, the - proposal is to change !PigInputFormat to contain an !InputFormat. !PigInputFormat will return !PigInputSplits, each of which contain an + proposal is to change !PigInputFormat to represent the job's !InputFormat to !Hadoop and internally to handle the complexity of multiple inputs and hence multiple !InputFormats. !PigInputFormat will return !PigSplits each of which contain an - !InputSplit. In addition, !PigInputSplit will contain the necessary information to allow Pig to correctly address tuples to the correct data + !InputSplit. In addition, !PigSplit will contain the necessary information to allow Pig to correctly address tuples to the correct data processing pipeline. - In order to support arbitrary Hadoop !InputFormats, it will be necessary to construct a load function, !InputFormatLoader, that will take an + In order to support arbitrary Hadoop !InputFormats, Pig can provide a load function, !InputFormatLoader, that will take an - !InputFormat as a constructor argument. When asked by Pig which !InputFormat to use, it will return the one indicated by the user. Its call to + !InputFormat as a constructor argument. Only !InputFormats which have zero argument constructors can be supported since Pig will try to instantiate the supplied !InputFormat using reflection. When asked by Pig which !InputFormat to use, it will return the one indicated by the user. Its call to getNext will then take the key and value provided by the associated !RecordReader and construct a two field tuple. These types will be converted to Pig types as follows: @@ -445, +445 @@ || !BooleanWritable || int|| In the future if Pig exposes boolean as a first class type, this would change to boolean || || !ByteWritable|| int|| || || !NullWritable|| null || || - || All others || byte array || || + || All others || byte array || How do we construct a byte array from arbitrary types? || Since the format of any other types are unknown to Pig and cannot be generalized, it does not make sense to provide casts from byte array to pig types via a !LoadCaster. If users wish to use an !InputFormat that uses types beyond these and cast them to Pig types, they can extend the @@ -469, +469 @@ Positioning information in an !InputSplit presents a problem. Hadoop 0.18 has a getPos call in the !InputSplit, but it has been removed in 0.20. The reason is that input from files can generally be assigned a position, though it may not always be - accurate, as in the bzip case. But some input formats position may not have meaning. Even if Pig does not switch to using !InputFormats it will + accurate, as in the bzip case. But for some input formats position may not have meaning. Even if Pig does not switch to using !InputFormats it will have to deal with this issue, just as MR has. + These changes will affect the !SamplableLoader interface. Currently it uses skip and getPos to move the underlying stream so that it can pick + up a sample of tuples out of a block. Since it would sit atop !InputFormat it would no longer have access to the underlying stream. It would be + changed instead to skip a number of tuples. + However, in some places Pig needs this position information. In particular, when building an index for a merge join, Pig needs a way to mark a + location in an input while building the index and then return to that position during the join. In this new proposal, the merge join index will contain filename and split index (index of the split in the List returned by InputFormat.getSplits()). The merge join code at run time will then seek to the right split in the file and process from that split on. For this to work th
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=16&rev2=17 -- 1. How will we worked with compressed files? !FileInputFormat already works with bzip and gzip compressed files, producing reasonable splits. !PigStorage will be reworked to depend on !FileInputFormat (or a descendant thereof, see next item) and should therefore be able to use this functionality. Currently Pig supports gz/bzip for arbitrary loadfunc/storefunc combinations. With this proposal, gz/bzip format will only be supported for load/store using PigStorage. - === Implementation details and status === + == Implementation details and status == - Current status + === Current status === A branch -'load-store-redesign' (http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has been created to undertake work on this proposal. As of today (Nov 2. 2009) this branch has simple load-store working for PigStorage and BinStorage. Joins on multiple inputs and multi store queries with multi query optimization also work. Some of the recent changes in the proposal above (the changes noted under Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not be comprehensive) of remaining tasks is listed in a subsection below. - Notes on implementation details + === Notes on implementation details === + This section is to document changes made at a high level to give an overall connected picture which code comments may not provide. + Changes to work with Hadoop !InputFormat model + + Changes to work with Hadoop !OutputFormat model + - Remaining Tasks + === Remaining Tasks === - * BinStorage needs to implement LoadMetadata's getSchema() to replace current determineSchema() + * !BinStorage needs to implement !LoadMetadata's getSchema() to replace current determineSchema() * piggybank loaders/storers need to be ported - * fix lineage code to use LoadCaster instead of LoadFunc + * fix lineage code to use !LoadCaster instead of !LoadFunc * local mode needs to be ported - * PigDump needs to be ported + * !PigDump needs to be ported - * poload needs to be ported + * !POLoad needs to be ported * Need to handle passing loadfunc specific info between different instances of loadfunc (Different instances in front end and between front end and back end - we need what is required in PIG-602) (setPartitionFilter() and pushOperators()for example needs this - these methods are called in the front end but the information passed is needed in the backend) - * For ResourceSchema to be effectively used for communicating schema, we must fix the two level access issues with + * For !ResourceSchema to be effectively used for communicating schema, we must fix the two level access issues with schema of bags in current schema before we make these changes, otherwise that same contagion will afflict us here. * Input/Output handler code in streaming needs to be ported * split by file will have to removed from language * fix code with FIXME in comment relating to load-store redesign - * Decide on what we should do with ReversibleLoadFunc and multiquery optimization + * Decide on what we should do with !ReversibleLoadFunc and multiquery optimization
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=17&rev2=18 -- === Notes on implementation details === This section is to document changes made at a high level to give an overall connected picture which code comments may not provide. - Changes to work with Hadoop !InputFormat model + Changes to work with Hadoop InputFormat model + Hadoop has the notion of a single InputFormat per job. This is restrictive since Pig processes multiple inputs in the same map reduce job (in the case of Join, Union or Cogroup). This is handled by !PigInputFormat which is the !InputFormat Pig communicates to Hadoop as the Job's !InputFormat. Changes to work with Hadoop !OutputFormat model @@ -530, +531 @@ * fix lineage code to use !LoadCaster instead of !LoadFunc * local mode needs to be ported * !PigDump needs to be ported - * !POLoad needs to be ported + * POLoad needs to be ported + * Need to handle passing loadfunc specific info between different instances of loadfunc (Different instances in front end and between front end and back end - we need what is required in PIG-602) (setPartitionFilter() and pushOperators()for example needs - * Need to handle passing loadfunc specific info between different instances of loadfunc (Different instances in front end and - between front end and back end - we need what is required in PIG-602) (setPartitionFilter() and pushOperators()for example needs this - these methods are called in the front end but the information passed is needed in the backend) + * For !ResourceSchema to be effectively used for communicating schema, we must fix the two level access issues with schema of bags in current schema before we make these changes, otherwise that same contagion will afflict us here. - * For !ResourceSchema to be effectively used for communicating schema, we must fix the two level access issues with - schema of bags in current schema before we make these changes, otherwise that same contagion will afflict us here. * Input/Output handler code in streaming needs to be ported * split by file will have to removed from language * fix code with FIXME in comment relating to load-store redesign * Decide on what we should do with !ReversibleLoadFunc and multiquery optimization - - == Changes == Sept 23 2009, Gates
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=24&rev2=25 -- * invoke !LoadFunc.setLocation() * Call getInputFormat() on the !LoadFunc and then createRecordReader() on the !InputFormat returned. Note that the above setLocation call needs to happen *before* the createRecordReader() call and the createRecordReader() call needs to be given a !TaskAttemptContext built out of the "updated (with location)" Configuration. * Wrap the !RecordReader returned above in !PigRecordReader class which is returned to Hadoop as the !RecordReader. !PigRecordReader has Text as key type (which is always sent with a null value to Hadoop since in pig, we really do not extract a key from input records) and a Tuple as a the value type (which is a tuple constructed from the input record). + + '''Open Question''': - We are hoping that !LoadFunc actually sets up the input location on the conf in the setLocation() call - and then using that conf in createRecordReader() call - what if it does this in getInputFormat()? Changes to work with Hadoop OutputFormat model Hadoop has the notion of a single !OutputFormat per job. !PigOutputFormat is the class indicated by Pig as the !OutputFormat for map reduce jobs compiled from pig scripts.
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=25&rev2=26 -- * split by file will have to removed from language * fix code with FIXME in comment relating to load-store redesign * Decide on what we should do with !ReversibleLoadFunc and multiquery optimization + * Address any '''Open Question'''s in this document == Changes == Sept 23 2009, Gates
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=21&rev2=22 -- Hadoop has the notion of a single InputFormat per job. This is restrictive since Pig processes multiple inputs in the same map reduce job (in the case of Join, Union or Cogroup). This is handled by !PigInputFormat which is the !InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In !PigInputFormat.getSplits(), the implementation processes each input in the following manner: * Instantiate the !LoadFunc associated with the input - * Make a clone of the Configuration passed in the getSplits() call and then invoke !LoadFunc.setlocation() using the clone. The reason a clone is necessary here is because generally in the setlocation() method, the loadfunc would communicate the location to its underlying !InputFormat. Typically !InputFormats store the location into the Configuration for use in the getSplits() call. For example, !FileInputFormat does this through !FileInputFormat.setInputLocation(Job job, String location). We don't updates to the Configuration for different inputs to over-write each other - hence the clone. + * Make a clone of the Configuration passed in the getSplits() call and then invoke !LoadFunc.setlocation() using the clone. The reason a clone is necessary here is because generally in the setlocation() method, the loadfunc would communicate the location to its underlying !InputFormat. Typically !InputFormats store the location into the Configuration for use in the getSplits() call. For example, !FileInputFormat does this through !FileInputFormat.setInputPaths(Job job, String location). We don't want updates to the Configuration for different inputs to over-write each other - hence the clone. * Call getInputFormat() on the !LoadFunc and then getSplits() on the !InputFormat returned. Note that the above setLocation call needs to happen *before* the getSplits() call and the getSplits() call needs to be given a !JobContext built out of the "updated (with location)" cloned Configuration. * Wrap each returned !InputSplit in !PigSplit to store information like the list of target operators (the pipeline) for this input, the index of the split in the List of Splits returned by getSplits (this is used during merge join index creation) etc (comments in PigSplit explain the members) @@ -535, +535 @@ * Instantiate the !LoadFunc associated with input represented by the PigSplit passed into !PigInputFormat.createRecordReader() * invoke !LoadFunc.setLocation() * Call getInputFormat() on the !LoadFunc and then createRecordReader() on the !InputFormat returned. Note that the above setLocation call needs to happen *before* the createRecordReader() call and the createRecordReader() call needs to be given a !TaskAttemptContext built out of the "updated (with location)" Configuration. + * Wrap the !RecordReader returned above in !PigRecordReader class which is returned to Hadoop as the !RecordReader. !PigRecordReader has Text as key type (which is always sent with a null value to Hadoop since in pig, we really do not extract a key from input records) and a Tuple as a the value type (which is a tuple constructed from the input record). Changes to work with Hadoop OutputFormat model + Hadoop has the notion of a single !OutputFormat per job. !PigOutputFormat is the class indicated by Pig as the !OutputFormat for map reduce jobs compiled from pig scripts. + + In !PigOutputFormat.checkOutputSpecs(), the implementation iterates over POStore(s) in the map and reduce phases and for each such store does the following: + * Instantiate the !StoreFunc associated with the POStore + * Make a clone of the JobContext passed in !PigOutputFormat.checkOutputSpecs() call and then invoke !StoreFunc.setStoreLocation() using the clone. The reason a clone is necessary here is because generally in the setStorelocation() method, the StoreFunc would communicate the location to its underlying !OutputFormat. Typically !OutputFormats store the location into the Configuration for use in the checkOutputSpecs() call. For example, !FileOutputFormat does this through !FileOutputFormat.setOutputPath(Job job, Path location). We don't want updates to the Configuration for different outputs to over-write each other - hence the clone. + * Call getOutputFormat() on the !StoreFunc and then checkOutputSpecs() on the !OutputFormat returned. Note that the above setStoreLocation call needs to happen *before* the checkOutputSpecs() call and the checkOutputSpecs() call needs to be given the "updated (with location)" cloned JobContext. + === Remaining Tasks === * !BinStorage needs to implement !LoadMetadata's getSchema() to repl
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=22&rev2=23 -- Changes to work with Hadoop OutputFormat model Hadoop has the notion of a single !OutputFormat per job. !PigOutputFormat is the class indicated by Pig as the !OutputFormat for map reduce jobs compiled from pig scripts. - In !PigOutputFormat.checkOutputSpecs(), the implementation iterates over POStore(s) in the map and reduce phases and for each such store does the following: + In !PigOutputFormat.checkOutputSpecs(), the implementation iterates over !POStore(s) in the map and reduce phases and for each such store does the following: - * Instantiate the !StoreFunc associated with the POStore + * Instantiate the !StoreFunc associated with the !POStore - * Make a clone of the JobContext passed in !PigOutputFormat.checkOutputSpecs() call and then invoke !StoreFunc.setStoreLocation() using the clone. The reason a clone is necessary here is because generally in the setStorelocation() method, the StoreFunc would communicate the location to its underlying !OutputFormat. Typically !OutputFormats store the location into the Configuration for use in the checkOutputSpecs() call. For example, !FileOutputFormat does this through !FileOutputFormat.setOutputPath(Job job, Path location). We don't want updates to the Configuration for different outputs to over-write each other - hence the clone. + * Make a clone of the !JobContext passed in !PigOutputFormat.checkOutputSpecs() call and then invoke !StoreFunc.setStoreLocation() using the clone. The reason a clone is necessary here is because generally in the setStorelocation() method, the !StoreFunc would communicate the location to its underlying !OutputFormat. Typically !OutputFormats store the location into the Configuration for use in the checkOutputSpecs() call. For example, !FileOutputFormat does this through !FileOutputFormat.setOutputPath(Job job, Path location). We don't want updates to the Configuration for different outputs to over-write each other - hence the clone. * Call getOutputFormat() on the !StoreFunc and then checkOutputSpecs() on the !OutputFormat returned. Note that the above setStoreLocation call needs to happen *before* the checkOutputSpecs() call and the checkOutputSpecs() call needs to be given the "updated (with location)" cloned JobContext. + + !PigOutputFormat.getOutputCommitter() returns a !PigOutputCommitter object. The !PigOutputCommitter internally keeps a list of OutputCommitters corresponding to !OutputFormat of !StoreFunc(s) in the POStore(s) in the map and reduce phases. It delegates all calls in the OutputCommitter class invoked by Hadoop to calls on the appropriate underlying committers. + + The other method in !OutputFormat is the getRecordWriter() method. In the single store case !PigOutputFormat.getRecordWriter() does the following: + * Instantiate the !StoreFunc associated with single !POStore. + * invoke !StoreFunc.setStoreLocation() + * Call getOutputFormat() on the !StoreFunc and then getRecordWriter() on the !OutputFormat returned. Note that the above setStoreLocation call needs to happen *before* the getRecordWriter() call and the getRecordWriter() call needs to be given a !TaskAttemptContext which has the "updated (with location)" Configuration. + * Wrap the !RecordWriter returned above in !PigRecordWriter class which is returned to Hadoop as the !RecordWriter. !PigRecordReader has WritableComparable as key type (which is always sent with a null value when we write, since in pig, we really do not have a key to store in the output( and a Tuple as a the value type (which is the output tuple). + + For the multi query optimized multi store case, there are multiple !POStores in the same map reduce job. In this case, the data is written out in the Pig map or reduce pipeline itself through the POStore operator. Details of this can be found in http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - "Internal Changes" section - "Store Operator" subsection. So from the pig runtime code, we never call Context.write() (which would have internally called PigRecordWriter.write()). So the handling of multi stores has not changed for writing data out for this redesign. === Remaining Tasks ===
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=19&rev2=20 -- == Implementation details and status == === Current status === - A branch -'load-store-redesign' (http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has been created to undertake work on this proposal. As of today (Nov 2. 2009) this branch has simple load-store working for PigStorage and BinStorage. Joins on multiple inputs and multi store queries with multi query optimization also work. Some of the recent changes in the proposal above (the changes noted under Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not be comprehensive) of remaining tasks is listed in a subsection below. + A branch -'load-store-redesign' (http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has been created to undertake work on this proposal. As of today (Nov 2. 2009) this branch has simple load-store working for !PigStorage and !BinStorage. Joins on multiple inputs and multi store queries with multi query optimization also work. Some of the recent changes in the proposal above (the changes noted under Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not be comprehensive) of remaining tasks is listed in a subsection below. === Notes on implementation details === This section is to document changes made at a high level to give an overall connected picture which code comments may not provide. Changes to work with Hadoop InputFormat model - Hadoop has the notion of a single InputFormat per job. This is restrictive since Pig processes multiple inputs in the same map reduce job (in the case of Join, Union or Cogroup). This is handled by !PigInputFormat which is the !InputFormat Pig communicates to Hadoop as the Job's !InputFormat. + Hadoop has the notion of a single InputFormat per job. This is restrictive since Pig processes multiple inputs in the same map reduce job (in the case of Join, Union or Cogroup). This is handled by !PigInputFormat which is the !InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In PigInputFormat.getSplits(), the implementation processes each input in the following manner: + + * Instantiate the LoadFunc associated with the input + * Make a clone of the Configuration passed in the getSplits() call and then invoke LoadFunc.setlocation() using the clone. The reason a clone is necessary here is because generally in the setlocation() method, the loadfunc would communicate the location to its underlying !InputFormat. Typically !InputFormats store the location into the Configuration for use in the getSplits() call. For example, !FileInputFormat does this through !FileInputFormat.setInputLocation(Job job, String location). We don't updates to the Configuration for different inputs to over-write each other - hence the clone. + * Call getInputFormat() on the !LoadFunc and then getSplits() on the !InputFormat returned. Note that the above setLocation call needs to happen *before* the getSplits() call and the getSplits() call needs to be given a !JobContext built out of the "updated (with location)" cloned Configuration. + * Wrap each returned !InputSplit in !PigSplit to store information like the list of target operators (the pipeline) for this input, the index of the split in the List of Splits returned by getSplits (this is used during merge join index creation) etc (comments in PigSplit explain the members) + + The list of target operators helps pig give the tuples from an input to the correct part of the pipeline in a multi input pipeline (like in join, cogroup, union). + + The other method in !InputFormat is createRecordReader which needs be given a !TaskAttemptContext. The Configuration present in the !TaskAttemptContext needs to have any information that might have been put into it as a result of the above LoadFunc.setLocation() call. However the !PigInputFormat.getSplits() method is called in the front-end by Hadoop and !PigInputFormat.createRecordReader() is called in the back-end. So we would need to somehow pass a Map between input and the input specific Configuration (updated with location and other information from the relevant LoadFunc.setLocation() call) from the front end to the back-end. One way to pass this map would be in the Configuration of the !JobContext passed to !PigInputFormat.getSplits(). However in Hadoop 0.20.1 this Configuration present in the !JobContext passed to !PigInputFormat.getSplits() is a copy of the Configuration which is serialized to the backend and used to create the !TaskAttemptContext passed in !PigInputFormat.createRecordReader(). Hence passing the map this way is not p
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=20&rev2=21 -- This section is to document changes made at a high level to give an overall connected picture which code comments may not provide. Changes to work with Hadoop InputFormat model - Hadoop has the notion of a single InputFormat per job. This is restrictive since Pig processes multiple inputs in the same map reduce job (in the case of Join, Union or Cogroup). This is handled by !PigInputFormat which is the !InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In PigInputFormat.getSplits(), the implementation processes each input in the following manner: + Hadoop has the notion of a single InputFormat per job. This is restrictive since Pig processes multiple inputs in the same map reduce job (in the case of Join, Union or Cogroup). This is handled by !PigInputFormat which is the !InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In !PigInputFormat.getSplits(), the implementation processes each input in the following manner: - * Instantiate the LoadFunc associated with the input + * Instantiate the !LoadFunc associated with the input - * Make a clone of the Configuration passed in the getSplits() call and then invoke LoadFunc.setlocation() using the clone. The reason a clone is necessary here is because generally in the setlocation() method, the loadfunc would communicate the location to its underlying !InputFormat. Typically !InputFormats store the location into the Configuration for use in the getSplits() call. For example, !FileInputFormat does this through !FileInputFormat.setInputLocation(Job job, String location). We don't updates to the Configuration for different inputs to over-write each other - hence the clone. + * Make a clone of the Configuration passed in the getSplits() call and then invoke !LoadFunc.setlocation() using the clone. The reason a clone is necessary here is because generally in the setlocation() method, the loadfunc would communicate the location to its underlying !InputFormat. Typically !InputFormats store the location into the Configuration for use in the getSplits() call. For example, !FileInputFormat does this through !FileInputFormat.setInputLocation(Job job, String location). We don't updates to the Configuration for different inputs to over-write each other - hence the clone. * Call getInputFormat() on the !LoadFunc and then getSplits() on the !InputFormat returned. Note that the above setLocation call needs to happen *before* the getSplits() call and the getSplits() call needs to be given a !JobContext built out of the "updated (with location)" cloned Configuration. * Wrap each returned !InputSplit in !PigSplit to store information like the list of target operators (the pipeline) for this input, the index of the split in the List of Splits returned by getSplits (this is used during merge join index creation) etc (comments in PigSplit explain the members) The list of target operators helps pig give the tuples from an input to the correct part of the pipeline in a multi input pipeline (like in join, cogroup, union). - The other method in !InputFormat is createRecordReader which needs be given a !TaskAttemptContext. The Configuration present in the !TaskAttemptContext needs to have any information that might have been put into it as a result of the above LoadFunc.setLocation() call. However the !PigInputFormat.getSplits() method is called in the front-end by Hadoop and !PigInputFormat.createRecordReader() is called in the back-end. So we would need to somehow pass a Map between input and the input specific Configuration (updated with location and other information from the relevant LoadFunc.setLocation() call) from the front end to the back-end. One way to pass this map would be in the Configuration of the !JobContext passed to !PigInputFormat.getSplits(). However in Hadoop 0.20.1 this Configuration present in the !JobContext passed to !PigInputFormat.getSplits() is a copy of the Configuration which is serialized to the backend and used to create the !TaskAttemptContext passed in !PigInputFormat.createRecordReader(). Hence passing the map this way is not possible. Hence we re-create the side effects of the !LoadFunc.setLocation() call in !PigInputFormat.getSplits() in !PigInputFormat.createRecordReader() by the following sequence: + The other method in !InputFormat is createRecordReader which needs be given a !TaskAttemptContext. The Configuration present in the !TaskAttemptContext needs to have any information that might have been put into it as a result of the above !LoadFunc.setLocation() call. However the !PigInputFormat.getSplits() method is cal
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=23&rev2=24 -- == Implementation details and status == === Current status === + https://issues.apache.org/jira/browse/PIG-966 is the main JIRA to track progress. A branch -'load-store-redesign' (http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has been created to undertake work on this proposal. + - A branch -'load-store-redesign' (http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has been created to undertake work on this proposal. As of today (Nov 2. 2009) this branch has simple load-store working for !PigStorage and !BinStorage. Joins on multiple inputs and multi store queries with multi query optimization also work. Some of the recent changes in the proposal above (the changes noted under Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not be comprehensive) of remaining tasks is listed in a subsection below. + Status on Nov 2. 2009: This branch has simple load-store working for !PigStorage and !BinStorage. Joins on multiple inputs and multi store queries with multi query optimization also work. Some of the recent changes in the proposal above (the changes noted under Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not be comprehensive) of remaining tasks is listed in a subsection below. === Notes on implementation details === This section is to document changes made at a high level to give an overall connected picture which code comments may not provide.
[Pig Wiki] Trivial Update of "PigMix" by DmitriyRyaboy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigMix" page has been changed by DmitriyRyaboy. The comment on this change is: added back-dated weighted averages. http://wiki.apache.org/pig/PigMix?action=diff&rev1=12&rev2=13 -- || L12 multi-store || 150 || fails|| 781 || 499 || 804 || || Total time || 1791 || 13638|| 4420 || 3284 || 2950|| || Compared to hadoop || 1.0 || 7.6 || 2.5 || 1.8 || 1.6 || + || Weighted Average || 1.0 || 11.2 || 3.26 || 2.20 || 1.97|| The totb run of 1/20/09 includes the change to make !BufferedPositionedInputStream use a buffer instead of relying on hadoop to buffer. @@ -60, +61 @@ || L12 multi-store || 139 || 159 || || Total time || 1826 || 2764|| || Compared to hadoop || N/A || 1.5 || - + || Weighted average || N/A || 1.83|| Run date: June 28, 2009, run against top of trunk as of that day. Note that the columns got reversed in this one (Pig then MR) - || Test || Pig run time || Java run time || Multiplier || + || Test || Pig run time || Java run time || Multiplier || || PigMix_1 || 204 || 117.33 || 1.74 || || PigMix_2 || 110.33 || 50.67 || 2.18 || || PigMix_3 || 292.33 || 125 || 2.34 || @@ -79, +80 @@ || PigMix_11 || 206.33 || 136.67 || 1.51 || || PigMix_12 || 173 || 161.67 || 1.07 || || Total || 2729.67 || 1948.33 || 1.40 || + || Weighted avg || || || 1.68 || Run date: August 27, 2009, run against top of trunk as of that day. @@ -96, +98 @@ || PigMix_11 || 180 || 121 || 1.49 || || PigMix_12 || 156 || 160.67 || 0.97 || || Total || 2440.67 || 2001.67 || 1.22 || + || Weighted avg || || || 1.53 || Run date: October 18, 2009, run against top of trunk as of that day. With this run we included a new measure, weighted average. Our previous multiplier that we have been publishing takes the total time of running all 12 Pig Latin scripts and compares it to the total time of @@ -118, +121 @@ || PigMix_11 || 145.33 || 168.67|| 0.86 || || PigMix_12 || 55.33|| 95.33 || 0.58 || || Total || 1352.33 || 1357 || 1.00 || - Weighted Average: 1.04 + || Weighted avg || || || 1.04 || == Features Tested ==
[Pig Wiki] Update of "PigTalksPapers" by DmitriyRyaboy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigTalksPapers" page has been changed by DmitriyRyaboy. http://wiki.apache.org/pig/PigTalksPapers?action=diff&rev1=7&rev2=8 -- * Pig: Making Hadoop Easy, talk at !ApacheCon US 2008: [[http://wiki.apache.org/pig/ApacheConUS2008|ApacheConUS2008]] * Pig: Making Hadoop Easy, talk at !ApacheCon EU 2009: [[attachment:ApacheConEurope09.ppt|ApacheConEU2009]] * Pig talk given at 2009 Hadoop Summit [[attachment:HadoopSummit2009.ppt|HadoopSummit2009]] + + * Pig usage at Twitter, a presentation from NoSQL East [[http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009|slides]] + * Pig talk for Pittsburgh HUG: intro, explanation of joins, research ideas [[http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/|slides]] == Pig Papers == * Pig paper at VLDB 2009: [[http://infolab.stanford.edu/~olston/publications/vldb09.pdf|pdf]]
New attachment added to page PigTalksPapers on Pig Wiki
Dear Wiki user, You have subscribed to a wiki page "PigTalksPapers" for change notification. An attachment has been added to that page by AlanGates. Following detailed information is available: Attachment name: apacheconus2009.pptx Attachment size: 337661 Attachment link: http://wiki.apache.org/pig/PigTalksPapers?action=AttachFile&do=get&target=apacheconus2009.pptx Page link: http://wiki.apache.org/pig/PigTalksPapers
New attachment added to page PigTalksPapers on Pig Wiki
Dear Wiki user, You have subscribed to a wiki page "PigTalksPapers" for change notification. An attachment has been added to that page by AlanGates. Following detailed information is available: Attachment name: vldb_presentation.pptx Attachment size: 351814 Attachment link: http://wiki.apache.org/pig/PigTalksPapers?action=AttachFile&do=get&target=vldb_presentation.pptx Page link: http://wiki.apache.org/pig/PigTalksPapers
[Pig Wiki] Update of "PigTalksPapers" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigTalksPapers" page has been changed by AlanGates. http://wiki.apache.org/pig/PigTalksPapers?action=diff&rev1=8&rev2=9 -- * Pig: Making Hadoop Easy, talk at !ApacheCon US 2008: [[http://wiki.apache.org/pig/ApacheConUS2008|ApacheConUS2008]] * Pig: Making Hadoop Easy, talk at !ApacheCon EU 2009: [[attachment:ApacheConEurope09.ppt|ApacheConEU2009]] * Pig talk given at 2009 Hadoop Summit [[attachment:HadoopSummit2009.ppt|HadoopSummit2009]] - * Pig usage at Twitter, a presentation from NoSQL East [[http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009|slides]] * Pig talk for Pittsburgh HUG: intro, explanation of joins, research ideas [[http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/|slides]] + * Pig talk at !ApacheCon US 2009: [[attachment:apacheconus2009.pptx|slides]] == Pig Papers == - * Pig paper at VLDB 2009: [[http://infolab.stanford.edu/~olston/publications/vldb09.pdf|pdf]] + * Pig paper at VLDB 2009: [[http://infolab.stanford.edu/~olston/publications/vldb09.pdf|pdf]], [[attachment:vldb_presentation.pptx|slides]] from the associated talk. * Pig Latin paper at SIGMOD 2008: [[http://infolab.stanford.edu/~olston/publications/sigmod08.pdf|pdf]] * Pig optimization paper at USENIX 2008: [[http://infolab.stanford.edu/~olston/publications/usenix08.pdf|pdf]]
[Pig Wiki] Update of "PiggyBank" by FlipKromer
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PiggyBank" page has been changed by FlipKromer. The comment on this change is: More detail on the CLASSPATH -- need to have hadoop and commons-logging jar's in there too. http://wiki.apache.org/pig/PiggyBank?action=diff&rev1=13&rev2=14 -- = Piggy Bank - User Defined Pig Functions = - This is a place for Pig users to share their functions. The functions are contributed "as-is". If you find a bug or if you feel a function is missing, take the time to fix it or write it yourself and contribute the changes. <> + == Using Functions == - To see how to use your own functions in a pig script, please, see the [[http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm|Pig Latin Reference Manual]]. Note that only JAVA functions are supported at this time. The functions are currently distributed in source form. Users are required to checkout the code and build the package themselves. No binary distributions or nightly builds are available at this time. @@ -14, +13 @@ To build a jar file that contains all available user defined functions (UDFs), please follow the steps: 1. Checkout UDF code: `svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank` - 2. Add pig.jar to your ClassPath : `export CLASSPATH=$CLASSPATH:/path/to/pig.jar` + 1. Add pig.jar to your ClassPath : `export CLASSPATH=$CLASSPATH:/path/to/pig.jar` - 3. Build the jar file: from `trunk/contrib/piggybank/java` directory run `ant`. This will generate `piggybank.jar` in the same directory. + 1. Build the jar file: from `trunk/contrib/piggybank/java` directory run `ant`. This will generate `piggybank.jar` in the same directory. + Make sure your classpath includes the hadoop jars as well. This workedforme using the cloudera CDH2 / hadoop AMIs: + {{{ + pig_version=0.4.99.0+10 ; pig_dir=/usr/lib/pig ; + hadoop_version=0.20.1+152 ; hadoop_dir=/usr/lib/hadoop ; + export CLASSPATH=$CLASSPATH:${hadoop_dir}/hadoop-${hadoop_version}-core.jar:${hadoop_dir}/hadoop-${hadoop_version}-tools.jar:${hadoop_dir}/hadoop-${hadoop_version}-ant.jar:${hadoop_dir}/lib/commons-logging-1.0.4.jar:${pig_dir}/pig-${pig_version}-core.jar + }}} To obtain `javadoc` description of the functions run `ant javadoc` from `trunk/contrib/piggybank/java` directory. The documentation is generate in `trunk/contrib/piggybank/java/build/javadoc` directory. - + To use a function, you need to figure out which package it belongs to. The top level packages correspond to the function type and currently are: - * org.apache.pig.piggybank.comparison - for custom comparator used by ORDER operator + * org.apache.pig.piggybank.comparison - for custom comparator used by ORDER operator - * org.apache.pig.piggybank.evaluation - for eval functions like aggregates and column transformations + * org.apache.pig.piggybank.evaluation - for eval functions like aggregates and column transformations - * org.apache.pig.piggybank.filtering - for functions used in FILTER operator + * org.apache.pig.piggybank.filtering - for functions used in FILTER operator - * org.apache.pig.piggybank.grouping - for grouping functions + * org.apache.pig.piggybank.grouping - for grouping functions - * org.apache.pig.piggybank.storage - for load/store functions + * org.apache.pig.piggybank.storage - for load/store functions (The exact package of the function can be seen in the javadocs or by navigating the source tree.) @@ -37, +42 @@ TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; STORE TweetsInaug INTO 'meta/inaug/tweets_inaug' ; }}} - - == Contributing Functions == - For details on how to create UDFs, please, see the [[http://wiki.apache.org/pig/UDFManual|UDF Manual]]. Note that only JAVA functions are supported at this time. To contribute a new function, please, follow the steps: 1. Check existing javadoc to make sure that the function does not already exist as described in [[#Using_Functions]] - 2. Checkout UDF code as described in [[#Using_Functions]] + 1. Checkout UDF code as described in [[#Using_Functions]] - 3. Place your java code in the directory that makes sense for your function. The directory structure as of now has two levels: function type as described in [[#Using_Functions]] and function subtype (like math or string for eval functions) for some of the types. If you feel that your function requires a new subtype, feel free to add one. + 1. Place your java code in the directory that makes sense for your function. The directory structure as of now has two levels: function type as described in [[#Using_Functions]] and function subtype (like math or string for eval functions) for some of the types. If you feel t
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=131&rev2=132 -- ||2199||LOLoad: schema mismatch|| ||2200||PruneColumns: Error getting top level project|| ||2201||Could not validate schema alias|| + ||2202||Error change distinct/sort to use secondary key optimizer|| + ||2203||Sort on columns from different inputs|| + ||2204||Error setting secondary key plan|| + ||2205||Error visiting POForEach inner plan|| + ||2206||Error visiting POSort inner plan|| + ||2207||POForEach inner plan has more than 1 root|| + ||2208||Exception visiting foreach inner plan|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Trivial Update of "LoadStoreRedesignProposal " by DmitriyRyaboy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by DmitriyRyaboy. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=28&rev2=29 -- // Probably more in here } - public long mBytes; // size in megabytes + public long mBytes; // "disk" size in megabytes (file size or equivalent) public long numRecords; // number of records public ResourceFieldStatistics[] fields; @@ -608, +608 @@ Added a new section 'Implementation details and status' + Nov 11, Dmitriy Ryaboy + Minor clarification of meaning of mBytes in !ResourceStatistics +
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=29&rev2=30 -- /** - * Communicate to the loader the load string used in Pig Latin to refer to the - * object(s) being loaded. The location string passed to the LoadFunc here + * Communicate to the loader the location of the object(s) being loaded. + * The location string passed to the LoadFunc here is the return value of - * is the return value of {...@link LoadFunc#relativeToAbsolutePath(String, String)} + * {...@link LoadFunc#relativeToAbsolutePath(String, String)} * * This method will be called in the backend multiple times. Implementations * should bear in mind that this method is called multiple times and should
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=30&rev2=31 -- } public ResourceFieldSchema[] fields; - public Map byName; enum Order { ASCENDING, DESCENDING } public int[] sortKeys; // each entry is an offset into the fields array.
[Pig Wiki] Update of "LoadStoreRedesignProposal" by The jasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by ThejasNair. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=31&rev2=32 -- Mechanism to read side files Pig needs to read side files in many places like in Merge Join, Order by, Skew join, dump etc. To facilitate doing this in an easy manner, a utility !LoadFunc called !ReadToEndLoader has been introduced. Though this has been implemented as a !LoadFunc, the only !LoadFunc method which is truly implemented is getNext(). The usage pattern is to construct an instance using the constructor which would take a reference to the true !LoadFunc (which can read the side file data) and then repeatedly call getNext() till null is encountered in the return value. The implementation of !ReadToEndLoader hides the actions of getting !InputSplits from the underlying !InputFormat and then processing each split by getting the !RecordReader and processing data in the split before moving to the next. + + Changes to skew join sampling (PoissonSampleLoader) + See discussion in [[https://issues.apache.org/jira/browse/PIG-1062|PIG-1062]] . + + '''Problem 1''': + Earlier version of !PoissonSampleLoader stored the size on disk as an extra last column in the sampled tuples it returned in map phase of sampling MR job. This was used in !PartitionSkwewedKeys udf in the reduce stage of sampling job to compute total number of tuples using input-file-size/avg-disk-sz-from-samples . Avg-disk-sz-from-samples is not available with new loader design, because getPosition() is not there. + + '''Solution :''' + !PoissonSampleLoader returns a special tuple with number of rows in that Map, in addition to the sampled tuples. To create this special tuple, the max row length in input sampled tuples is tracked, and a new tuple with size of max_row_length + 2 is created. + And spl_tuple[max_row_length ] = "marker_string" + spl_tuple[max_row_length + 1] = num_rows + The size of max_row_length+2 is used because the join key can be an expression, which is evaluated on the columns in tuples returned by the sampler, and the expression might expect specific data types to be present in certain (<= max_row_length) locations of the tuple. + If number of tuples in sample is 0, the special tuple is not sent. + + In !PartitionSkwewedKeys udf in the reduce stage,the udf iterates over the tuples to find these special tuples and calculate the total number of rows. + + + '''Problem 2''': + !PoissonSampleLoader samples 17 tuples from every set of tuples that will fit into reducer memory (see PigSkewedJoinSpec) . Let us call this number of tuples that fit into reducer memory - X. Ie we need to sample one tuple every X/17 tuples. + Earlier, the number of tuples to be sampled was calculated before the tuples were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of samples to be sampled in a map, the formula used was = number-of-reducer-memories-needed * 17 / number-of-splits + Where - + number-of-reducer-memories-needed = (total_file_size * disk_to_mem_factor)/available_reducer_heap_size + disk_to_mem_factor has default of 2. + + Then !PoissonSampleLoader would return sampled tuples by skipping split-size/num_samples bytes at a time. + + With new loader we have to skip some number of tuples instead of bytes. But we don't have an estimate of total number of tuples in the input. + One way to work around this would be to use size of tuple in memory to estimate size of tuple in disk using above disk_to_mem_factor, then number of tuples to be skipped will be = (split-size/avg_mem_size_of_tuple)/numSamples + + But the use of disk_to_mem_factor is very dubious, the real disk_to_mem_factor will vary based on compression-algorithm, data characteristics (sorting etc), and encoding. + + '''Solution''': + The goal is to sample one tuple every X/17 tuples. (X = number of tuples that fit in available reducer memory) + To estimate X, we can use available_reducer_heap_size/average-tuple-mem-size + Number of tuples skipped for every sampled tuple = 1/17 * ( available_reducer_heap_size/average-tuple-mem-size) + + The average-tuple-mem-size and number-of-tuples-to-be-skippled-every-sampled-tuple is recalculated after a new tuple is sampled. + + Changes to order-by sampling (RandomSampler) + + '''Problem''': With new interface, we cannot use the old approach of dividing the size of file by number of samples required and skipping that many bytes to get new sample. + + '''Proposal''': + In getNext(),allocate a buffer for T elements, populate it with the first T tuples, and continue scanning the partition. For every ith next() call, generate a random number r s.t. 0<=r
[Pig Wiki] Update of "PigAccumulatorSpec" by yinghe
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigAccumulatorSpec" page has been changed by yinghe. http://wiki.apache.org/pig/PigAccumulatorSpec -- New page: = Accumulator UDF = == Introduction == For data processing with PIG, it is very common to call "group by" or "cogroup" to group input tuples by a key, then call one or more UDFs to process each group. For example: {{{ A = load 'mydata'; B = group A by $0; C = foreach B generate group, myUDF1(A), myUDF2(A, 'some_param'), myUDF3(A); store C into 'myresult'; }}} The current implementation is during grouping process, all tuples that belongs to the same key are materialized into a DataBag, and the DataBag(s) are passed to the UDFs. This causes performance and memory problem. For a large key, if its tuples can not fit into memory, performance has to sacrifice to spill extra data into disk. Since many UDFs do not really need to see all the tuples that belongs to a key at the same time, it is possible to pass those tuples as batches. A good example would be like COUNT(), SUM(). Tuples can be passed to UDFs in accumulative manner. When all the tuples are passed, the final method is called to retrieve the value. This way, we can minimize the memory usage and improve performance by avoiding data spill. == UDF change == An Accumulator interface is defined. UDFs that are able to process tuples in accumulative manner should implement this interface. It is defined as following: {{{ public interface Accumulator { /** * Pass tuples to the UDF. You can retrive DataBag by calling b.get(index). * Each DataBag may contain 0 to many tuples for current key */ public void accumulate(Tuple b) throws IOException; /** * Called when all tuples from current key have been passed to accumulate. * @return the value for the UDF for this key. */ public T getValue(); /** * Called after getValue() to prepare processing for next key. */ public void cleanup(); } }}} UDF should still extend EvalFunc as before. The PIG engine would detect based on context whether tuples can be processed accumulatively. If not, then regular EvalFunc would be called. Therefore, for a UDF, both interfaces should be implemented properly == Use Cases == PIG engine would process tuples accumulatively only when all of the UDFs implements Accumulator interface. If one of the UDF is not Accumulator, then all UDFs are called by their EvalFunc interface as regular UDFs. Following are examples accumulator interface of UDFs would be called: * group by {{{ A = load 'mydata'; B = group A by $0; C = foreach B generate group, myUDF(A); store C into 'myresult'; }}} * cogroup {{{ A = load 'mydata1'; B = load 'mydata2'; C = cogroup A by $0, B by $0; D = foreach C generate group, myUDF(A), myUDF(B); store D into 'myresult'; }}} * group by with sort {{{ A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, myUDF(D); } store C into 'myresult'; }}} * group by with distinct {{{ A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, myUDF(E); } store C into 'myresult'; }}} == When to Call Accumulator == MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible to run in accumulative mode. Before AccumulatorOptimizer is called, another optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks if POSort or PODistinct in the inner plan of foreach can be removed/replaced by using secondary sorting key supported by hadoop. If it is POSort, then it is removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of this optimizer, the last two use cases with order by and distinct inside foreach inner plan can still run in accumulative mode. The AccumulatorOptimizer checks the reducer plan and enables accumulator if following criteria are met: * The reducer plan uses POPackage as root, not any of its sub-classes. POPackage is not for distinct, and any of its input is not set as inner. * The successor of POPackage is a POForeach. * The leaves of each POForEach input plan is an ExpressionOperator and it must be one of the following: * ConstantExpression * POProject, whose result type is not BAG, or TUPLE and overloaded * POMapLookup * POCase * UnaryExpressionOperator * BinaryExpressionOperator * POBinCond * POUserFunc that implements Accumulator interface and its inputs contains only ExpressionOperation, POForEach, or POSortedDistinct, but not another POUserFunc. Therefore, if under POForEach, there ar
[Pig Wiki] Update of "PigAccumulatorSpec" by yinghe
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigAccumulatorSpec" page has been changed by yinghe. http://wiki.apache.org/pig/PigAccumulatorSpec?action=diff&rev1=1&rev2=2 -- = Accumulator UDF = - == Introduction == For data processing with PIG, it is very common to call "group by" or "cogroup" to group input tuples by a key, then call one or more UDFs to process each group. For example: @@ -11, +10 @@ C = foreach B generate group, myUDF1(A), myUDF2(A, 'some_param'), myUDF3(A); store C into 'myresult'; }}} - - The current implementation is during grouping process, all tuples that belongs to the same key are materialized into a DataBag, and the DataBag(s) are passed to the UDFs. This causes performance and memory problem. For a large key, if its tuples can not fit into memory, performance has to sacrifice to spill extra data into disk. + The current implementation is during grouping process, all tuples that belongs to the same key are materialized into a DataBag, and the DataBag(s) are passed to the UDFs. This causes performance and memory problem. For a large key, if its tuples can not fit into memory, performance has to sacrifice to spill extra data into disk. Since many UDFs do not really need to see all the tuples that belongs to a key at the same time, it is possible to pass those tuples as batches. A good example would be like COUNT(), SUM(). Tuples can be passed to UDFs in accumulative manner. When all the tuples are passed, the final method is called to retrieve the value. This way, we can minimize the memory usage and improve performance by avoiding data spill. @@ -22, +20 @@ {{{ public interface Accumulator { /** - * Pass tuples to the UDF. You can retrive DataBag by calling b.get(index). + * Pass tuples to the UDF. You can retrive DataBag by calling b.get(index). * Each DataBag may contain 0 to many tuples for current key */ public void accumulate(Tuple b) throws IOException; @@ -32, +30 @@ * @return the value for the UDF for this key. */ public T getValue(); - + - /** + /** - * Called after getValue() to prepare processing for next key. + * Called after getValue() to prepare processing for next key. */ public void cleanup(); } }}} - UDF should still extend EvalFunc as before. The PIG engine would detect based on context whether tuples can be processed accumulatively. If not, then regular EvalFunc would be called. Therefore, for a UDF, both interfaces should be implemented properly == Use Cases == PIG engine would process tuples accumulatively only when all of the UDFs implements Accumulator interface. If one of the UDF is not Accumulator, then all UDFs are called by their EvalFunc interface as regular UDFs. Following are examples accumulator interface of UDFs would be called: -* group by + * group by - {{{ + . {{{ A = load 'mydata'; B = group A by $0; C = foreach B generate group, myUDF(A); store C into 'myresult'; - }}} + }}} -* cogroup + * cogroup - {{{ + . {{{ A = load 'mydata1'; B = load 'mydata2'; C = cogroup A by $0, B by $0; D = foreach C generate group, myUDF(A), myUDF(B); store D into 'myresult'; - }}} + }}} -* group by with sort + * group by with sort - {{{ + . {{{ A = load 'mydata'; B = group A by $0; C = foreach B { @@ -72, +69 @@ generate group, myUDF(D); } store C into 'myresult'; - }}} + }}} -* group by with distinct + * group by with distinct - {{{ + . {{{ A = load 'mydata'; B = group A by $0; C = foreach B { @@ -84, +81 @@ generate group, myUDF(E); } store C into 'myresult'; - }}} + }}} == When to Call Accumulator == - MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible to run in accumulative mode. Before AccumulatorOptimizer is called, another optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks if POSort or PODistinct in the inner plan of foreach can be removed/replaced by using secondary sorting key supported by hadoop. If it is POSort, then it is removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of this optimizer, the last two use cases with order by and distinct inside foreach inner plan can still run in accumulative mode. + . MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible to run in accumulative mode. Before AccumulatorOptimizer is called, another optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks if POSort or PODistinct in the inner plan of foreach can be removed/replaced by using secondary sorting key
New attachment added to page PigAccumulatorSpec/homes/yinghe/De sktop on Pig Wiki
Dear Wiki user, You have subscribed to a wiki page "PigAccumulatorSpec/homes/yinghe/Desktop" for change notification. An attachment has been added to that page by yinghe. Following detailed information is available: Attachment name: SequenceDiagram.jpg Attachment size: 51846 Attachment link: http://wiki.apache.org/pig/PigAccumulatorSpec/homes/yinghe/Desktop?action=AttachFile&do=get&target=SequenceDiagram.jpg Page link: http://wiki.apache.org/pig/PigAccumulatorSpec/homes/yinghe/Desktop
[Pig Wiki] Update of "LoadStoreRedesignProposal" by The jasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by ThejasNair. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=32&rev2=33 -- '''Problem 2''': !PoissonSampleLoader samples 17 tuples from every set of tuples that will fit into reducer memory (see PigSkewedJoinSpec) . Let us call this number of tuples that fit into reducer memory - X. Ie we need to sample one tuple every X/17 tuples. - Earlier, the number of tuples to be sampled was calculated before the tuples were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of samples to be sampled in a map, the formula used was = number-of-reducer-memories-needed * 17 / number-of-splits + Earlier, the number of tuples to be sampled was calculated before the tuples were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of samples to be sampled in a map, the formula used was = number-of-reducer-memories-needed * 17 / number-of-splits <> Where - - number-of-reducer-memories-needed = (total_file_size * disk_to_mem_factor)/available_reducer_heap_size + number-of-reducer-memories-needed = (total_file_size * disk_to_mem_factor)/available_reducer_heap_size<> disk_to_mem_factor has default of 2. Then !PoissonSampleLoader would return sampled tuples by skipping split-size/num_samples bytes at a time. - With new loader we have to skip some number of tuples instead of bytes. But we don't have an estimate of total number of tuples in the input. + With new loader we have to skip some number of tuples instead of bytes. But we don't have an estimate of total number of tuples in the input.<> One way to work around this would be to use size of tuple in memory to estimate size of tuple in disk using above disk_to_mem_factor, then number of tuples to be skipped will be = (split-size/avg_mem_size_of_tuple)/numSamples But the use of disk_to_mem_factor is very dubious, the real disk_to_mem_factor will vary based on compression-algorithm, data characteristics (sorting etc), and encoding. '''Solution''': - The goal is to sample one tuple every X/17 tuples. (X = number of tuples that fit in available reducer memory) + The goal is to sample one tuple every X/17 tuples. (X = number of tuples that fit in available reducer memory).<> - To estimate X, we can use available_reducer_heap_size/average-tuple-mem-size + To estimate X, we can use available_reducer_heap_size/average-tuple-mem-size.<> Number of tuples skipped for every sampled tuple = 1/17 * ( available_reducer_heap_size/average-tuple-mem-size) The average-tuple-mem-size and number-of-tuples-to-be-skippled-every-sampled-tuple is recalculated after a new tuple is sampled.
[Pig Wiki] Update of "PigAccumulatorSpec" by yinghe
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigAccumulatorSpec" page has been changed by yinghe. http://wiki.apache.org/pig/PigAccumulatorSpec?action=diff&rev1=2&rev2=3 -- }}} == When to Call Accumulator == - . MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible to run in accumulative mode. Before AccumulatorOptimizer is called, another optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks if POSort or PODistinct in the inner plan of foreach can be removed/replaced by using secondary sorting key supported by hadoop. If it is POSort, then it is removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of this optimizer, the last two use cases with order by and distinct inside foreach inner plan can still run in accumulative mode. + . MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible to run in accumulative mode. Before AccumulatorOptimizer is called, another optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks if POSort or PODistinct in the inner plan of foreach can be removed/replaced by using secondary sorting key supported by hadoop. If it is POSort, then it is removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of this optimizer, the last two use cases with order by and distinct inside foreach inner plan can still run in accumulative mode. The AccumulatorOptimizer checks the reducer plan and enables accumulator if following criteria are met: - The AccumulatorOptimizer checks the reducer plan and enables accumulator if following criteria are met: * The reducer plan uses POPackage as root, not any of its sub-classes. POPackage is not for distinct, and any of its input is not set as inner. * The successor of POPackage is a POForeach. * The leaves of each POForEach input plan is an ExpressionOperator and it must be one of the following: @@ -109, +108 @@ {{attachment:/homes/yinghe/Desktop/SequenceDiagram.jpg}} + == Internal Changes == + === Accumulator === + . A new interface that UDF can implement if it can run in accumulative mode. + + === PhysicalOperator === + . Add new methods setAccumulative(), setAccumStart(), setAccumEnd() to flag a physical operator to run in accumulative mode, and mark the start and end of accumulation. This change is in patch of PIG-1038. + + === MapReduceLauncher === + . Create AccumulatorOptimizer and use it to visit the plan. + + === AccumulatorOptimizer === + . Another MROpPlanVisitor. It checks the reduce plan, if it meets all the criteria, it sets the "accumulative" flag to POPackage and POForEach. It is created and invoked by MapReducerLauncher. + + === POStatus === + . Add a new state "STATUS_BATCH_OK" to indicate a batch is processed successfully in accumulative mode. + + === POForEach === + . If its "accumulative" flag is set, the bags passed to it through a tuple are AccumulativeBag as opposed to regular tuple bags. It gets AccumulativeTupleBuffer from the bag. Then it runs a while loop of calling nextBatch() of AccumulativeTupleBuffer, pass the input to inner plans. If an inner plan contains any UDF, the inner plan returns POStatus.STATUS_BATCH_OK if current batch is processed successfully. When there are no more batches to process, POForEach notifies each inner plan that accumulation is done, it makes a final call to get result and out of the while loop. At the end, POForEach returns the result to its successor in reducer plan. The operators that called POForEach doesn't need to know whether POForEach gets its result through regular mode or accumulative mode. + + === AccumulativeBag === + . An implementation of DataBag use by POPackage for processing data in accumulative mode. This bag doesn't contain all tuples from iterator. Instead, it wrapps up AccumultiveTupleBuffer, which contains iterator to pull tuples out in batches. Call the iterator() of this call only gives you the tuples for current batch. + + === AccumulativeTupleBuffer === + . An underlying buffer that is shared by all AccumulativeBags (one bag for group by, multiple bags for cogroup) generated by POPackage. POPackage has an inner class which implements this interface. POPackage creates an instance of this buffer and set it into the AccumulativeBags. This buffer has methods to retrieve next batch of tuples, which in turn calls methods of POPackage to read tuples out of iterator, and put them in an internal list. The AccumulativeBag has access to that list to return iterator of tuples. + + === POPackage === + . If its "accumulative" flag is set, it creates AccumulativeBag and AccumulativeTupleBuffer as opposed to creating default tuple bags. It then sets AccumulativeTupleBuffer into AccumulativeBag, and set ACcumulativeBag into the tuple in result. + POPack
[Pig Wiki] Trivial Update of "PigStreamingFunctionalSpec " by MarcioSilva
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigStreamingFunctionalSpec" page has been changed by MarcioSilva. The comment on this change is: correcting what appears to be a typo.. http://wiki.apache.org/pig/PigStreamingFunctionalSpec?action=diff&rev1=47&rev2=48 -- Streaming can have three separate meaning in the context of Pig project: 1. A specific way of submitting jobs to Hadoop: Hadoop Streaming - 2. A form of processing in which the entire portion of the dataset that corresponds to a task in sent to the task and output streams out. There is no temporal or causal correspondence between an input record and specific output records. + 2. A form of processing in which the entire portion of the dataset that corresponds to a task is sent to the task and output streams out. There is no temporal or causal correspondence between an input record and specific output records. 3. The use of non-Java functions with Pig. The goal of Pig with respect to streaming is to support #2 for (a)Java UDFs, (b)non-Java UDFs and (c)user specified binaries/scripts. We will start with (c) since it would be most beneficial for the users. It is not our goal to be feature-by-feature compatible with Hadoop streaming as it is too open-ended and might force us to implement features that we don't necessarily want in Pig.
[Pig Wiki] Update of "PigSkewedJoinSpec" by ThejasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigSkewedJoinSpec" page has been changed by ThejasNair. http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=12&rev2=13 -- In order to use skewed join, -* Skewed join currently works with tow-table inner join. +* Skewed join currently works with two-table inner join. * Append 'using "skewed"' construct to the join to force pig to use skewed join * pig.skewedjoin.reduce.memusage specifies the fraction of heap available for the reducer to perform the join. A low fraction forces pig to use more reducers but increases copying cost. For pigmix tests, we have seen good performance when we set this value in the range 0.1 - 0.4. However, note that this is hardly an accurate range. Its value depends on the amount of heap available for the operation, the number of columns in the input and the skew. It is best obtained by conducting experiments to achieve a good performance. The default value is =0.5=.
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by AlanGates. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=35&rev2=36 -- '''Proposal''': The goal is to sample tuples with equal probability for any tuple getting sampled (assuming number of tuples to be sampled is much smaller than total number of tuples). If N is the number of samples required. In getNext(),allocate a buffer for N elements, populate it with the first N tuples, and continue scanning the partition. For every ith next() call, generate a random number r s.t. 0<=rhttp://hadoop.apache.org/common/docs/r0.20.1/streaming.html .) + + I propose that Pig move to a model of using Hadoop's default streaming format, which is to expect new-line separated records, with tab being used as a field separator. Hadoop allows users to + redefine the field separator, and so should Pig. This will also match the current default of using !PigStorage as the (de)serializer for streaming. As before Pig should support communicating + with the executable via either stdin and stdout or files. This will force a syntax change in Pig Latin. Currently, if a user wants to stream data to an executable with comma separated fields + instead of tab separated fields, the syntax is: + + {{{ + define CMD `perl PigStreaming.pl - foo nameMap` input(stdin using PigStorage(',') output(stdout using PigStorage(','); + A = load 'file'; + B = stream B through CMD; + }}} + + The syntax should change to remove the reference to a store and load functions, as they are no longer meaningful. Thus the above would become: + + {{{ + define CMD `perl PigStreaming.pl - foo nameMap` input(stdin using ',') output(stdout using ','); + A = load 'file'; + B = stream B through CMD; + }}} + + From an implementation viewpoint, the functionality required to write to and read from the streaming binary will be equivalent to the tuple parsing and serialization of !PigStorage.getNext() and + !PigStorage.putNext(). While it will not be possible to use PigStorage directly every effort should be made to share this code (most likely by putting the actual code in static + utility methods that can be called by each class) to avoid double code maintenance costs. === Remaining Tasks === @@ -665, +696 @@ * Changes to order-by sampling (!RandomSampler) * Changes to skew join sampling (!PoissonSampleLoader) + Nov 23 2009, Gates + * Added section "Changes to Streaming" +
[Pig Wiki] Trivial Update of "LoadStoreRedesignProposal " by DmitriyRyaboy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by DmitriyRyaboy. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=36&rev2=37 -- /** * Set statistics about the data being written. + * @throws IOException */ - void setStatistics(ResourceStatistics stats); + void setStatistics(ResourceStatistics stats, String location, Configuration conf) throws IOException; + + /** + * Set schema of the data being written + * @throws IOException + */ + void setSchema(ResourceSchema schema, String location, Configuration conf) throws IOException; } @@ -699, +706 @@ Nov 23 2009, Gates * Added section "Changes to Streaming" + Nov 23 2009, Dmitriy Ryaboy + * updated StoreMetadata to match changes made to LoadMetadata +
[Pig Wiki] Update of "GroupFunction" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "GroupFunction" page has been changed by AlanGates. http://wiki.apache.org/pig/GroupFunction?action=diff&rev1=2&rev2=3 -- <> + + '''AS OF PIG 0.2 GROUP FUNCTIONS HAVE BEEN REMOVED FROM THE LANGUAGE. THE FOLLOWING APPLIES ONLY TO PIG 0.1.''' + == Group Functions ==
[Pig Wiki] Update of "LoadStoreRedesignProposal" by Ala nGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreRedesignProposal" page has been changed by AlanGates. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=38&rev2=39 -- !PigStorage.putNext(). While it will not be possible to use PigStorage directly every effort should be made to share this code (most likely by putting the actual code in static utility methods that can be called by each class) to avoid double code maintenance costs. + It has been suggested that we should switch to the typed bytes protocol that is available in Hadoop and Hive (see + https://issues.apache.org/jira/browse/PIG-966?focusedCommentId=12781695&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781695 ). While we cannot switch the default, we can make this streaming + connection an interface so that users can easily extend it in the future. The interface should be quite simple: + + {{{ + interface PigToStream { + + /** + * Given a tuple, produce an array of bytes to be passed to the streaming + * executable. + */ + public byte[] serialize(Tuple t) throws IOException; + + /** + * Set the record delimiter to use when communicating with the streaming + * executable. The default if this is not set is \n. + */ + public void setRecordDelimiter(byte delimiter); + } + + interface StreamToPig { + + /** + * Given a byte array from a streaming executable, produce a tuple. + */ + public Tuple deserialize(byte[]) throws IOException; + + /** + * Set the record delimiter to use when reading from the streaming + * executable. The default if this is not set is \n. + */ + public void setRecordDelimiter(byte delimiter); + } + }}} + + The default implementation of this would be as suggested above. The syntax for describing how data is (de)serialized would then stay as it currently is, except instead of giving a + !StoreFunc the user would specify a !PigToStream, and instead of specifying a !LoadFunc a !StreamToPig. + + Additionally, it has been noted that this change takes away the current optimization of Pig Latin scripts such as the following: + + {{{ + A = load 'myfile' split by 'file'; + B = stream A through 'mycmd'; + store B into 'outfile'; + }}} + + In this case Pig will optimize the query by removing the load function and replacing it with !BinaryStorage, a function which simply passes the data as is to the streaming + executable. It does not record or field parsing. Similarly, the store in the above script would be replaced with !BinaryStorage. + + We have two options to replace this. First, we could say that if a class implementing !PigToStream also implements !InputFormat, then Pig will drop the Load statement and use that + !InputFormat directly to load data and then pass the results to the stream. The same would be done with !StreamToPig, !OutputFormat and store. Second, we could create + !IdentityLoader and !IdentityStreamToPig functions. !IdentityLoader.getNext would return a tuple that just had one bytearray, which would be the entire record. This would then be a + trivial serialization via the default !PigToStream. Similarly !IdentityStreamToPig would take the bytes returned by the stream and put them in a tuple of a single bytearray. The + store function would then naturally translate this tuple into the underlying bytes. + Functionally these are basically equivalent, since Pig would need to write code similar to the !IdentityLoader etc. for the second case. So I believe the primary difference is in + how it is presented to the user not the functionality or code written underneath. + + Both of these approaches suffer from the problem that they assume !TextInputFormat and !TextOutputFormat. For any other IF/OF it will not be clear how to parse key, value + pairs out of the stream data. + + This optimization represents a fair amount of work. As the current optimization is not documented, it is not clear how many users are using it. Based on that I vote that we + do not implement this optimization until such time as we see a need for it. === Remaining Tasks === * !BinStorage needs to implement !LoadMetadata's getSchema() to replace current determineSchema() @@ -709, +771 @@ Nov 23 2009, Dmitriy Ryaboy * updated StoreMetadata to match changes made to LoadMetadata + Nov 25 2009, Gates + * Updated section on streaming to suggest creating an interface for streaming (de)serializers rather than having only one hardwired option. Also added some thoughts on possible replacements for the current !BinaryStorage/split by file optimization. +
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=132&rev2=133 -- ||6015||During execution, encountered a Hadoop error.|| ||6016||Out of memory.|| ||6017||Execution failed, while processing '|| + ||6018||Error while reading input|| == Change Log ==
[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanMa njunath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigSkewedJoinSpec" page has been changed by SriranjanManjunath. http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=13&rev2=14 -- = Skewed Join = <> + == Introduction == - - Parallel joins are vulnerable to the presence of skew in the underlying data. If the underlying data is sufficiently skewed, load imbalances will swamp any of the parallelism gains [[#References|(1)]]. In order to counteract this problem, skewed join computes a histogram of the key space and uses this data to allocate reducers for a given key. Skewed join does not place a restriction on the size of the input keys. It accomplishes this by splitting one of the input on the join predicate and streaming the other input. + Parallel joins are vulnerable to the presence of skew in the underlying data. If the underlying data is sufficiently skewed, load imbalances will swamp any of the parallelism gains [[#References|(1)]]. In order to counteract this problem, skewed join computes a histogram of the key space and uses this data to allocate reducers for a given key. Skewed join does not place a restriction on the size of the input keys. It accomplishes this by splitting one of the input on the join predicate and streaming the other input. <> - <> + == Use cases == - Skewed join can be used when the underlying data is sufficiently skewed and the user needs a finer control over the allocation of reducers to counteract the skew. It should also be used when the data associated with a given key is too large to fit in memory. {{{ @@ -17, +16 @@ C = JOIN big BY b1, massive BY m1 USING "skewed"; }}} - In order to use skewed join, -* Skewed join currently works with two-table inner join. + * Skewed join currently works with two-table inner join. -* Append 'using "skewed"' construct to the join to force pig to use skewed join + * Append 'using "skewed"' construct to the join to force pig to use skewed join -* pig.skewedjoin.reduce.memusage specifies the fraction of heap available for the reducer to perform the join. A low fraction forces pig to use more reducers but increases copying cost. For pigmix tests, we have seen good performance when we set this value in the range 0.1 - 0.4. However, note that this is hardly an accurate range. Its value depends on the amount of heap available for the operation, the number of columns in the input and the skew. It is best obtained by conducting experiments to achieve a good performance. The default value is =0.5=. + * pig.skewedjoin.reduce.memusage specifies the fraction of heap available for the reducer to perform the join. A low fraction forces pig to use more reducers but increases copying cost. For pigmix tests, we have seen good performance when we set this value in the range 0.1 - 0.4. However, note that this is hardly an accurate range. Its value depends on the amount of heap available for the operation, the number of columns in the input and the skew. It is best obtained by conducting experiments to achieve a good performance. The default value is =0.5=. - <> + == Requirements == - -* Support a 'skewed' condition for the join command - Modify Join operator to have a "skewed" option. + * Support a 'skewed' condition for the join command - Modify Join operator to have a "skewed" option. -* Handle considerably large skew in the input data efficiently + * Handle considerably large skew in the input data efficiently -* Join tables whose keys are too big to fit in memory + * Join tables whose keys are too big to fit in memory + <> + == Implementation == - Skewed join translates into two map/reduce jobs - Sample and Join. The first job samples the input records and computes a histogram of the underlying key space. The second map/reduce job partitions the input table and performs a join on the predicate. In order to join the two tables, one of the tables is partitioned and other is streamed to the reducer. The map task of the join job uses the ~-pig.keydist-~ file to determine the number of reducers per key. It then sends the key to each of the reducers in a round robin fashion. Skewed joins happen in the reduce phase of the join job. {{attachment:partition.jpg}} <> + === Sampler phase === - If the underlying data is sufficiently skewed, load imbalances will result in a few reducers getting a lot of keys. As a first task, the sampler creates a histogram of the key distribution and stores it in the ~-pig.keydist-~ file. In order to reduce spillage, the sampler conservatively estimates the number of rows that can be sent to a single reducer based on the memory available for the reducer. The memory available for the reducer is a product of the heap size and the memusage parameter specified by the user. Using
[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanMa njunath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigSkewedJoinSpec" page has been changed by SriranjanManjunath. http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=14&rev2=15 -- '''!NullablePartitionWritable''' - This is an adapter class which provides a partition index to the NullableWritable class. The partition index is used by both the paritioning and the streaming table. For non skewed keys, this value is set to -1. + This is an adapter class which provides a partition index to the !NullableWritable class. The partition index is used by both the paritioning and the streaming table. For non skewed keys, this value is set to -1. '''!PigMapReduce'''
[Pig Wiki] Update of "PigSkewedJoinSpec" by yinghe
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigSkewedJoinSpec" page has been changed by yinghe. http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=15&rev2=16 -- Number of Tuples from First Table (tupleCount) = (sampleCount / totalSampleCount) * (inputFileSize / avgDiskUsage) Number of Reducers = (int) Math.round(Math.ceil((double) tupleCount / tupleMCount)); }}} + + For example, if we assume + * total number of samples = 200 + * total number of samples with key k1 = 30 + * size of input file = 1G. + * totalMemory = 150M + * avgMemUsage for tuples of k1 = 150 bytes + * avgDiskUsage for tuples of k1 = 100 bytes + + then, + * estimated total number of k1 that can fit in memory = 150M/150 = 1M + * estimated total number of tuples from input file = 1G/100 = 10M tuples + * estimated number of tuples for k1 from input file = (30/200) * 10M = 1.5M + * estimated total number of reducers for k1 = Math.ceil (1.5M/1M) = 2 + + This calculation is done on every key of samples. If a key requires more than 1 reducer, it is regarded as a skewed key, and pre-allocated with multiple reducers. The reducers are allocated to skewed keys in round robin fashion. + This UDF generates an output which will be used by the following join job. The format of the output file is a map. It has two keys: * totalreducers: the number of total reducers for second job
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=133&rev2=134 -- ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||1108||Duplicated schema|| ||1109||Input ( ) on which outer join is desired should have a valid schema|| + ||1110||"Unsupported query: You have an partition column () inside a in the filter condition.|| ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=134&rev2=135 -- ||1107||Try to merge incompatible types (eg. numerical type vs non-numeircal type)|| ||1108||Duplicated schema|| ||1109||Input ( ) on which outer join is desired should have a valid schema|| - ||1110||"Unsupported query: You have an partition column () inside a in the filter condition.|| + ||1110||Unsupported query: You have an partition column () inside a in the filter condition.|| + ||||Use of partition column/condition with non partition column/condition in filter expression is not supported.|| ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=135&rev2=136 -- ||2206||Error visiting POSort inner plan|| ||2207||POForEach inner plan has more than 1 root|| ||2208||Exception visiting foreach inner plan|| + ||2209||Internal error while processing any partition filter conditions in the filter after the load|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges -- New page: = Backward incompatible changes in Pig 0.7.0 = Pig 0.7.0 will include some major changes to Pig most of them driven by the [[LoadStoreRedesignProposal | Load-Store redesign]]. Some of this changes will not be backward compatible and will require users to change the pig scripts or their UDFs. This document is intended to keep track of this changes to that we can document them for the release. == Changes to the Load and Store functions == == Handling Compressed Data == == Local Mode == == Streaming == == Other Changes == - Split by file == Open Questions ==
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=136&rev2=137 -- ||1109||Input ( ) on which outer join is desired should have a valid schema|| ||1110||Unsupported query: You have an partition column () inside a in the filter condition.|| ||||Use of partition column/condition with non partition column/condition in filter expression is not supported.|| + ||1112||Unsupported query: You have an partition column () in a construction like: (pcond and ...) or (pcond and ...) where pcond is a condition on a partition column.|| ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists||
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=1&rev2=2 -- = Backward incompatible changes in Pig 0.7.0 = - Pig 0.7.0 will include some major changes to Pig most of them driven by the [[LoadStoreRedesignProposal | Load-Store redesign]]. Some of this changes will not be backward compatible and will require users to change the pig scripts or their UDFs. This document is intended to keep track of this changes to that we can document them for the release. + Pig 0.7.0 will include some major changes to Pig most of them driven by the [[LoadStoreRedesignProposal | Load-Store redesign]]. Some of these changes will not be backward compatible and will require users to change their pig scripts or their UDFs. This document is intended to keep track of such changes so that we can document them for the release. - == Changes to the Load and Store functions == + == Changes to the Load and Store Functions == == Handling Compressed Data == + + In 0.6.0 or earlier versions Pig supported bzip compressed files with extensions of .bz or .bz2 as well as gzip compressed files with .gz extension. Pig was able to both read and write files in this format with the understanding that gzip compressed files could not be split across multiple maps while bzip compressed files could. Also, data compression was completely decoupled from the data format and Load/Store functions meaning that any loader could read compressed data and any store function could write it just by the virtue of having the right extension on the files it was reading or writing. + + With Pig 0.7.0 the read/write functionality is taking over by Hadoop's Input/OutputFormat and how compression is handled or whether it is handled at all depends on the Input/OutputFormat used by the loader/store function. + + The main input format that supports compression is TextInputFormat. It supports bzip files with .bz2 extension and gzip files with .gz extension. '''Note that it does not support .bz files'''. PigStorage is the only loader that comes with Pig that is derived from TextInputFormat which means it will be able to handle .bz2 and .gz files. Other laders such as BinStorage will no longer support compression. + + On the store side, TextOutputFormat also supports compression but the store function needs do to additional work to enable it. Again, PigStorage will support compressions while other functions will not. + + If you have a custom load/store function that needs to support compression, you would need to make sure that the underlying Input/OutputFormat supports this type of compression. + == Local Mode == == Streaming == == Other Changes ==
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=2&rev2=3 -- == Local Mode == == Streaming == - == Other Changes == + == Split by File == - - Split by file + In the earlier versions of Pig, a user could specify "split by file" on the loader statement which would make sure that each map got the entire file rather than the files were further divided into blocks. This feature was primarily design for streaming optimization but could also be used with loaders that can't deal with incomplete records. We don't believe that this functionality has been widely used. + + Because the slicing of the data is no longer in Pig's control, we can't support this feature generically for every loader. If a particular loader needs this functionality, it will need to make sure that the underlying InputFormat supports it. + + We will have a different approach for streaming optimization if that functionality is necessary. == Open Questions ==
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=137&rev2=138 -- ||2207||POForEach inner plan has more than 1 root|| ||2208||Exception visiting foreach inner plan|| ||2209||Internal error while processing any partition filter conditions in the filter after the load|| + ||2210||Internal Error in logical optimizer.|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=138&rev2=139 -- ||1110||Unsupported query: You have an partition column () inside a in the filter condition.|| ||||Use of partition column/condition with non partition column/condition in filter expression is not supported.|| ||1112||Unsupported query: You have an partition column () in a construction like: (pcond and ...) or (pcond and ...) where pcond is a condition on a partition column.|| + ||1113||Please provide uri to the metadata server using -Dmetadata.uri system property|| ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=139&rev2=140 -- ||1110||Unsupported query: You have an partition column () inside a in the filter condition.|| ||||Use of partition column/condition with non partition column/condition in filter expression is not supported.|| ||1112||Unsupported query: You have an partition column () in a construction like: (pcond and ...) or (pcond and ...) where pcond is a condition on a partition column.|| - ||1113||Please provide uri to the metadata server using -Dmetadata.uri system property|| ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists||
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=3&rev2=4 -- Pig 0.7.0 will include some major changes to Pig most of them driven by the [[LoadStoreRedesignProposal | Load-Store redesign]]. Some of these changes will not be backward compatible and will require users to change their pig scripts or their UDFs. This document is intended to keep track of such changes so that we can document them for the release. == Changes to the Load and Store Functions == + + TBW + == Handling Compressed Data == In 0.6.0 or earlier versions Pig supported bzip compressed files with extensions of .bz or .bz2 as well as gzip compressed files with .gz extension. Pig was able to both read and write files in this format with the understanding that gzip compressed files could not be split across multiple maps while bzip compressed files could. Also, data compression was completely decoupled from the data format and Load/Store functions meaning that any loader could read compressed data and any store function could write it just by the virtue of having the right extension on the files it was reading or writing. @@ -19, +22 @@ == Local Mode == == Streaming == + + There are two things that are changing in streaming. + + First, in the initial (0.7.0) release, '''we will not support for optimization''' where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and that the optimization was never documented and so unlekly to be used. + + Second, '''you can no longer use load/store functions for (de)serialization.''' + == Split by File == In the earlier versions of Pig, a user could specify "split by file" on the loader statement which would make sure that each map got the entire file rather than the files were further divided into blocks. This feature was primarily design for streaming optimization but could also be used with loaders that can't deal with incomplete records. We don't believe that this functionality has been widely used.
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=4&rev2=5 -- == Changes to the Load and Store Functions == - TBW + TBW [Need to take a load (with and withoutcustom slicer) and a store function and create new versions as examples. Can use PigStorage for (1) and (3) but need some loader for (2).] + == Handling Compressed Data == @@ -21, +22 @@ If you have a custom load/store function that needs to support compression, you would need to make sure that the underlying Input/OutputFormat supports this type of compression. == Local Mode == + + The main change here is that we switched from Pig's native local mode to Hadoop's local mode. This change should be transparent for most applications. Possible differnces you will see: + + 1. Hadoop local mode is about order of magnitude slower than Pig's local mode. Something that Hadoop team promised to address. + 2. For algebraic functions, no the entire Algebraic interface will be used which is likely a good think if you are using local mode for testing your production applications. + == Streaming == There are two things that are changing in streaming. First, in the initial (0.7.0) release, '''we will not support for optimization''' where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and that the optimization was never documented and so unlekly to be used. - Second, '''you can no longer use load/store functions for (de)serialization.''' + Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that needed to be implemented for custom (de)serializations. The defaul (PigStorage) format will continue to work. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. == Split by File ==
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=5&rev2=6 -- == Changes to the Load and Store Functions == - TBW [Need to take a load (with and withoutcustom slicer) and a store function and create new versions as examples. Can use PigStorage for (1) and (3) but need some loader for (2).] + TBW [Need to take a load (with and without custom slicer) and a store function and create new versions as examples. Can use PigStorage for (1) and (3) but need to choose a loader for (2).] == Handling Compressed Data == @@ -32, +32 @@ There are two things that are changing in streaming. - First, in the initial (0.7.0) release, '''we will not support for optimization''' where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and that the optimization was never documented and so unlekly to be used. + First, in the initial (0.7.0) release, '''we will not support for optimization''' where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and that the optimization was never documented and so unlikely to be used. - Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that needed to be implemented for custom (de)serializations. The defaul (PigStorage) format will continue to work. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. + Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that needed to be implemented for custom (de)serializations. The default (PigStorage) format will continue to work. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. == Split by File ==
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=6&rev2=7 -- == Local Mode == - The main change here is that we switched from Pig's native local mode to Hadoop's local mode. This change should be transparent for most applications. Possible differnces you will see: + The main change here is that we switched from Pig's native local mode to Hadoop's local mode. This change should be transparent for most applications. Possible differnces you will see are: 1. Hadoop local mode is about order of magnitude slower than Pig's local mode. Something that Hadoop team promised to address. - 2. For algebraic functions, no the entire Algebraic interface will be used which is likely a good think if you are using local mode for testing your production applications. + 2. For algebraic functions, now the entire Algebraic interface will be used which is likely a good thing if you are using local mode for testing your production applications. == Streaming == There are two things that are changing in streaming. - First, in the initial (0.7.0) release, '''we will not support for optimization''' where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and that the optimization was never documented and so unlikely to be used. + First, in the initial (0.7.0) release, '''we will not support optimization''' where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and the optimization was never documented and so unlikely to be used. Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that needed to be implemented for custom (de)serializations. The default (PigStorage) format will continue to work. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. @@ -46, +46 @@ == Open Questions == + Q: Should String->Text conversion be part of this release. + A: Pros: 20-30% improved memory utilization; cons: more compatibility is broken. +
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=7&rev2=8 -- We will have a different approach for streaming optimization if that functionality is necessary. + == Access to Local Files from Map-Reduce Mode + + In the earlier version of Pig, you could access a local file from map-reduce mode by prepending file:// to the file location: + + {{{ + A = load 'file:/mydir/myfile'; + ... + }}} + + When Pig processed this statement, it would first copy the data to DFS and then import it into the execution pipeline. + + In Pig 0.7.0, you can no longer do this and if this functionality is still desired, you can add the copy into your script manually: + + {{{ + fs copyFromLocal src dist + A = load 'dist'; + + }}} + == Open Questions == Q: Should String->Text conversion be part of this release.
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=8&rev2=9 -- First, in the initial (0.7.0) release, '''we will not support optimization''' where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and the optimization was never documented and so unlikely to be used. - Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that needed to be implemented for custom (de)serializations. The default (PigStorage) format will continue to work. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. + Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that needed to be implemented for custom (de)serializations. The default (PigStorage) format will continue to work. This formar is now implemented by a class called org.apache.pig.impl.streaming.PigStreaming that can be also used directly in the streaming statement. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. + + We have also removed org.apache.pig.builtin.BinaryStorage loader/store function and org.apache.pig.builtin.PigDump which were only used from within straming. They can be restored if needed - we would just need to implement the corresponding Input/OutputFormats. == Split by File == @@ -44, +46 @@ We will have a different approach for streaming optimization if that functionality is necessary. - == Access to Local Files from Map-Reduce Mode + == Access to Local Files from Map-Reduce Mode == In the earlier version of Pig, you could access a local file from map-reduce mode by prepending file:// to the file location:
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=9&rev2=10 -- Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that needed to be implemented for custom (de)serializations. The default (PigStorage) format will continue to work. This formar is now implemented by a class called org.apache.pig.impl.streaming.PigStreaming that can be also used directly in the streaming statement. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. - We have also removed org.apache.pig.builtin.BinaryStorage loader/store function and org.apache.pig.builtin.PigDump which were only used from within straming. They can be restored if needed - we would just need to implement the corresponding Input/OutputFormats. + We have also removed org.apache.pig.builtin.BinaryStorage loader/store function and org.apache.pig.builtin.PigDump which were only used from within streaming. They can be restored if needed - we would just need to implement the corresponding Input/OutputFormats. == Split by File == @@ -60, +60 @@ In Pig 0.7.0, you can no longer do this and if this functionality is still desired, you can add the copy into your script manually: {{{ - fs copyFromLocal src dist + fs -copyFromLocal src dist A = load 'dist'; }}}
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=10&rev2=11 -- }}} + == Removing Custom Comparators + + This functionality was added to deal with gap in Pig's early functionality - lack of numeric comparison in order by as well as lack of descending sort. This functionality has been present in last 4 releases and custom comparators has been depricated in the last several releases. They functionality is removed in this release. + == Open Questions == Q: Should String->Text conversion be part of this release.
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Olg aN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by OlgaN. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=11&rev2=12 -- }}} - == Removing Custom Comparators + == Removing Custom Comparators == This functionality was added to deal with gap in Pig's early functionality - lack of numeric comparison in order by as well as lack of descending sort. This functionality has been present in last 4 releases and custom comparators has been depricated in the last several releases. They functionality is removed in this release.
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=140&rev2=141 -- ||2182||Prune column optimization: Only relational operator can be used in column prune optimization.|| ||2183||Prune column optimization: LOLoad must be the root logical operator.|| ||2184||Prune column optimization: Fields list inside RequiredFields is null.|| + ||2185||Prune column optimization: Unable to prune columns.|| ||2186||Prune column optimization: Cannot locate node from successor|| ||2187||Column pruner: Cannot get predessors|| ||2188||Column pruner: Cannot prune columns|| @@ -422, +423 @@ ||2208||Exception visiting foreach inner plan|| ||2209||Internal error while processing any partition filter conditions in the filter after the load|| ||2210||Internal Error in logical optimizer.|| + ||2211||Column pruner: Unable to prune columns.|| + ||2212||Unable to prune plan.|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "PigMix" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigMix" page has been changed by AlanGates. http://wiki.apache.org/pig/PigMix?action=diff&rev1=13&rev2=14 -- || PigMix_12 || 55.33|| 95.33 || 0.58 || || Total || 1352.33 || 1357 || 1.00 || || Weighted avg || || || 1.04 || + + Run date: January 4, 2010, run against 0.6 branch as of that day + || Test || Pig run time || Java run time || Multiplier || + || PigMix_1 || 138.33 || 112.67|| 1.23 || + || PigMix_2 || 66.33|| 39.33 || 1.69 || + || PigMix_3 || 199 || 83.33 || 2.39 || + || PigMix_4 || 59 || 60.67 || 0.97 || + || PigMix_5 || 80.33|| 113.67|| 0.71 || + || PigMix_6 || 65 || 77.67 || 0.84 || + || PigMix_7 || 63.33|| 61|| 1.04 || + || PigMix_8 || 40 || 47.67 || 0.84 || + || PigMix_9 || 214 || 215.67|| 0.99 || + || PigMix_10 || 284.67 || 284.33|| 1.00 || + || PigMix_11 || 141.33 || 151.33|| 0.93 || + || PigMix_12 || 55.67|| 115 || 0.48 || + || Total || 1407 || 1362.33 || 1.03 || + || Weighted Avg || || || 1.09 || + == Features Tested ==
[Pig Wiki] Update of "PigLogicalPlanOptimizerRewrite" b y AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigLogicalPlanOptimizerRewrite" page has been changed by AlanGates. http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite -- New page: == Problem Statement == The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. === Issues that Need to be Addressed in this Rework === '''One:''' !OperatorPlan has far too many operations. It has 29 public methods. This needs to be paired down to a minimal set of operators that are well defined. '''Two:''' Currently, relational operators (Join, Sort, etc.) and expression operators (add, equals, etc.) are both !LogicalOperators. Operators such as Cogroup that contain expressions have !OperatorPlans that contain these expressions. This was done for two reasons: 1. To make it easier for visitors to visit both types of operators (that is, visitors didn't have to have separate logic to handle expressions). 1. To better handle the ambiguous nature of inner plans in Foreach. However, it has led to visitors and graphs that are hard to understand. Both of the above concerns can be handled while breaking this binding so that relational and expression operators are seaprate types. '''Three:''' Related to the issue of relational and expression operators sharing a type is that inner plans have connections to outer plans. Take for example a script like {{{ A = load 'file1' as (x, y); B = load 'file2' as (u, v); C = cogroup A by x, B by u D = filter C by A.x > 0; }}} In this case the cogroup will have two inner plans, one of which will be a project of A.x and the other a project B.u. The !LOProject objects representing these projections will hold actual references to the !LOLoad operators for A and B. This makes disconecting and rearranging nodes in the plan much more difficult. Consider if the optimizer wants to move the filter in D above C. Now it has to not only change connections in the outer plan between load, cogroup, and filter; it also has to change connections in the first inner plan of C, because this now needs to point to the !LOFilter for D rather than the !LOLoad for A. '''Four:''' The work done on Operator and !OperatorPlan to support the original rules for the optimizer had two main problems: 1.. The set of primitives chosen were not the correct ones. 1.. The operations chosen were put on the generic super classes (Operator) rather than further down on the specific classes that would know how to implement them. '''Five:''' At a number of points efforts were made to keep the logical plan close to the physical plan. For example, !LOProject represents all of the same operations that !POProject does. While this is convenient in translation, it is not convenient when trying to optimize the plan. The !LogicalPlan needs to focus on reprenting the logic of the script in a way that is easy for semantic checkers (such as !TypeChecker) and the optimizer to work with. '''Six:''' The rule of one operation per operator was violated. !LOProject handles three separate roles (converting from a relational to an expression operator, actually projecting, and converting from an expression to a relational operator). This makes coding much more complex for the optimizer because when it encounters an !LOProject it must first determine which of these three roles it is playing before it can understand how to work with it. The following proposal will address all of these issues. == Proposed Methodology == Fixing these issues will require extensive changes, including a complete rewrite of Operator, !OperatorPlan, !PlanVisitor, !LogicalOperator, !LogicalPlan, !LogicalPlanVisitor, every current subclass of !LogicalOperator, and all existing optimizer rules. It will also require extensive changes, though not complete rewrites, in existing subclasses of !LogicalTransformer. To avoid destablizing the entire codebase during this operation, this will be done in a new set of packages as a totally separate set of classes. Linkage code will be written to translate the current !LogicalPlan to the new experimental !LogicalPlan class. A new
New attachment added to page PigLogicalPlanOptimizerRewrite on Pig Wiki
Dear Wiki user, You have subscribed to a wiki page "PigLogicalPlanOptimizerRewrite" for change notification. An attachment has been added to that page by AlanGates. Following detailed information is available: Attachment name: expressiontree.jpg Attachment size: 28430 Attachment link: http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite?action=AttachFile&do=get&target=expressiontree.jpg Page link: http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite
[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Pig070IncompatibleChanges" page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=12&rev2=13 -- First, in the initial (0.7.0) release, '''we will not support optimization''' where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and the optimization was never documented and so unlikely to be used. - Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that needed to be implemented for custom (de)serializations. The default (PigStorage) format will continue to work. This formar is now implemented by a class called org.apache.pig.impl.streaming.PigStreaming that can be also used directly in the streaming statement. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. + Second, '''you can no longer use load/store functions for (de)serialization.''' A new interface has been defined that has to be implemented for custom (de)serializations. The default (PigStorage) format will continue to work. This format is now implemented by a class called org.apache.pig.impl.streaming.PigStreaming that can be also used directly in the streaming statement. Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal. We have also removed org.apache.pig.builtin.BinaryStorage loader/store function and org.apache.pig.builtin.PigDump which were only used from within streaming. They can be restored if needed - we would just need to implement the corresponding Input/OutputFormats.
[Pig Wiki] Update of "PigJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal -- New page: = Pig Journal = This document is a successor to the ProposedRoadMap. Rather than simply propose the work going forward for Pig, it also summarizes work done in the past (back to Pig moving from a research project at Yahoo Labs to being a part of the Yahoo grid team, which was approximately the time Pig was first released to open source), current work, and proposed future work. Note that proposed future work is exactly that, __proposed__. There is no guarantee that it will be done, and the project is still open to input on whether and when such work should be done. == Completed Work == The following table contains a list of features that have been completed, as of Pig 0.6 || Feature || Available in Release || Comments || || Describe Schema || 0.1 || || || Explain Plan || 0.1 || || || Add log4j to Pig Latin || 0.1 || || || Parameterized Queries|| 0.1 || || || Streaming|| 0.1 || || || Documentation|| 0.2 || Docs are never really done of course, but Pig now has a setup document, tutorial, Pig Latin users and reference guides, a cookbook, a UDF writers guide, and API javadocs. || || Early error detection and failure|| 0.2 || When this was originally added to the !ProposedRoadMap it referred to being able to do type checking and other basic semantic checks. || || Remove automatic string encoding || 0.2 || || || Add ORDER BY DESC|| 0.2 || || || Add LIMIT|| 0.2 || || || Add support for NULL values || 0.2 || || || Types beyond String || 0.2 || || || Multiquery support || 0.3 || || || Add skewed join || 0.4 || || || Add merge join || 0.4 || || || Support Hadoop 0.20 || 0.5 || || || Improved Sampling|| 0.6 || There is still room for improvement for order by sampling || || Change bags to spill after reaching fixed size || 0.6 || Also created bag backed by Hadoop iterator for single UDF cases || || Add Accumulator interface for UDFs || 0.6 || || || Switch local mode to Hadoop local mode || 0.6 || || || Outer join for default, fragment-replicate, skewed || 0.6 || || || Make configuration available to UDFs || 0.6 || || == Work in Progress == This covers work that is currently being done. For each entry the main JIRA for the work is referenced. || Feature || JIRA || Comments || || Metadata || [[http://issues.apache.org/jira/browse/PIG-823|PIG-823]] || || || Query Optimizer || [[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || || || Load Store Redesign || [[http://issues.apache.org/jira/browse/PIG-966|PIG-966]] || || || Add SQL Support || [[http://issues.apache.org/jira/browse/PIG-824|PIG-824]] || || || Change Pig internal representation of charrarry to Text || [[http://issues.apache.org/jira/browse/PIG-1017|PIG-1017]] || Patch ready, unclear when to commit to minimize disruption to users and destabilization to code base. || || Integration with Zebra || [[http://issues.apache.org/jira/browse/PIG-833|PIG-833]] || || == Proposed Future Work == Work that the Pig project proposes to do in the future is further broken into three categories: 1. Work that we are agreed needs to be done, and also the approach to the work is generally agreed upon, but we have not gotten to it yet 2. Work that we are agreed needs to be done, but the approach is not yet clear or there is not general agreement as to which approach is best 3. Experimental, which includes features that
[Pig Wiki] Update of "PigTools" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigTools" page has been changed by AlanGates. http://wiki.apache.org/pig/PigTools?action=diff&rev1=13&rev2=14 -- 'hamake' utility allows you to automate incremental processing of datasets stored on HDFS using Hadoop tasks written in Java or using PigLatin scripts. + === Piglet === + http://github.com/iconara/piglet + + Piglet is a DSL for writing Pig Latin scripts in Ruby. Piglet aims to look like Pig Latin while allowing for things like loops and control of flow that are missing from Pig. + + === PigPen === http://issues.apache.org/jira/browse/PIG-366