[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=2rev2=3 -- return only required fields, it should implement LoadPushDown to improve query performance. * !LoadCaster has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from !DataByteArray fields to other types need to be supported. - The !LoadFunc abstract class + The !LoadFunc abstract class is the main class to extend to implement a loader. The methods which need to be overriden are explained below: + * getInputFormat() :This method will be called by Pig to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying !RecordReader) will be called by pig in the same manner (and in the same context) + as by Hadoop in a map-reduce java program. If the !InputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be + implemented using the new API in org.apache.hadoop.mapreduce. + * setLocation() :This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying !InputFormat. This method is called multiple + times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls. + * prepareToRead() : Through this method the !RecordReader associated with the !InputFormat provided by the !LoadFunc is passed to the !LoadFunc. The !RecordReader can then be used by the implementation in getNext() to return a + tuple representing a record of data back to pig. + * getNext() :The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in the new API, this is the method wherein the implementation will use the the underlying !RecordReader + and construct a tuple + + The following methods have default implementations in !LoadFunc and should be overridden only if needed: + * setUdfContextSignature():This method will be called by Pig both in the front end and back end to pass a unique signature to the Loader. The signature can be used to store into the !UDFContext any information which the + Loader needs to store between various method invocations in the front end and back end. A use case is to store !RequiredFieldList passed to it in !LoadPushDown.pushProjection(RequiredFieldList) for use in the back end before + returning tuples in getNext(). The default implementation in !LoadFunc has an empty body. This method will be called before other methods. + * relativeToAbsolutePath():Pig runtime will call this method to allow the Loader to convert a relative load location to an absolute location. The default implementation provided in !LoadFunc handles this for !FileSystem + locations. If the load source is something else, loader implementation may choose to override this. + + === Example Implementation === + The loader implementation in the example is a loader for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - + this is similar to current !PigStorage loader in Pig. The new implementation uses an existing Hadoop supported !Inputformat - !TextInputFormat as the underlying !InputFormat. + + {{{ + public class SimpleTextLoader extends LoadFunc { + protected RecordReader in = null; + private byte fieldDel = '\t'; + private ArrayListObject mProtoTuple = null; + private TupleFactory mTupleFactory = TupleFactory.getInstance(); + private static final int BUFFER_SIZE = 1024; + + public SimpleTextLoader() { + } + + /** + * Constructs a Pig loader that uses specified character as a field delimiter. + * + * @param delimiter + *the single byte character that is used to separate fields. + *(\t is the default.) + */ + public SimpleTextLoader(String delimiter) { + this(); + if (delimiter.length() == 1) { + this.fieldDel = (byte)delimiter.charAt(0); + } else if (delimiter.length() 1 delimiter.charAt(0) == '\\') { + switch (delimiter.charAt(1)) { + case 't': + this.fieldDel = (byte)'\t'; + break; + + case 'x': +fieldDel = + Integer.valueOf(delimiter.substring(2), 16).byteValue(); +break; + + case 'u': + this.fieldDel = +
[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=3rev2=4 -- = Overview = - This page describes how to go about writing Load functions and Store functions using the API available in Pig 0.7.0. + This page describes how to go about writing Load functions and Store functions using the API available in Pig 0.7.0. == How to implement a Loader == [[LoadFunc || http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: - * !LoadMetadata has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for + * !LoadMetadata has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for loader implementations to communicate the schema of the data back to pig. If a loader implementation returns data comprised of fields of real types (rather than !DataByteArray fields), it should provide the schema describing the data returned through the getSchema() method. The other methods are concerned with other types of metadata like partition keys and statistics. Implementations can return null return values for these methods if they are not applicable for that implementation. - * !LoadPushDown has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields + * !LoadPushDown has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data.If a loader implementation is able to efficiently - return only required fields, it should implement LoadPushDown to improve query performance. + return only required fields, it should implement !LoadPushDown to improve query performance. - * !LoadCaster has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from !DataByteArray fields to other types need to be supported. + * !LoadCaster has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from !DataByteArray fields to other types need to be supported. The !LoadFunc abstract class is the main class to extend to implement a loader. The methods which need to be overriden are explained below: - * getInputFormat() :This method will be called by Pig to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying !RecordReader) will be called by pig in the same manner (and in the same context) + * getInputFormat() :This method will be called by Pig to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying !RecordReader) will be called by pig in the same manner (and in the same context) - as by Hadoop in a map-reduce java program. If the !InputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be + as by Hadoop in a map-reduce java program. If the !InputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be implemented using the new API in org.apache.hadoop.mapreduce. - * setLocation() :This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying !InputFormat. This method is called multiple + * setLocation() :This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying !InputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls. - * prepareToRead() : Through this method the !RecordReader
[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=4rev2=5 -- == How to implement a Loader == [[LoadFunc || http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: + * !LoadMetadata has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for loader implementations to communicate the schema of the data back to pig. If a loader implementation returns data comprised of fields of real types (rather than !DataByteArray fields), it should provide the schema describing the data returned through the getSchema() method. The other methods are concerned with other types of metadata like partition keys and statistics. Implementations can return null return values for these methods if they are not applicable for that implementation. + * !LoadPushDown has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data.If a loader implementation is able to efficiently return only required fields, it should implement !LoadPushDown to improve query performance. - * !LoadMetadata has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for - loader implementations to communicate the schema of the data back to pig. If a loader implementation returns data comprised of fields of real types (rather than !DataByteArray fields), it should provide the schema describing - the data returned through the getSchema() method. The other methods are concerned with other types of metadata like partition keys and statistics. Implementations can return null return values for these methods if they are - not applicable for that implementation. - * !LoadPushDown has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields - are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data.If a loader implementation is able to efficiently - return only required fields, it should implement !LoadPushDown to improve query performance. * !LoadCaster has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from !DataByteArray fields to other types need to be supported. The !LoadFunc abstract class is the main class to extend to implement a loader. The methods which need to be overriden are explained below: + * getInputFormat() :This method will be called by Pig to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying !RecordReader) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the !InputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be implemented using the new API in org.apache.hadoop.mapreduce. + * setLocation() :This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying !InputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls. - * getInputFormat() :This method will be called by Pig to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying !RecordReader) will be called by pig in the same manner (and in the same context) - as by Hadoop in a map-reduce java program. If the !InputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be - implemented using the new API in
[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=5rev2=6 -- + #format wiki + #language en + + Navigation(children) + TableOfContents + = Overview = This page describes how to go about writing Load functions and Store functions using the API available in Pig 0.7.0.
[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=7rev2=8 -- = Overview = This page describes how to go about writing Load functions and Store functions using the API available in Pig 0.7.0. - The main motivation for the changes in Pig 0.7.0 load/store api is to move closer to using Hadoop's InputFormat and OutputFormat classes. This way pig users/developers can create new LoadFunc and StoreFunc implementation based on existing Hadoop InputFormat and OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the InputFormat and likewise on the writing end, the complexity of writing will lie in the OutputFormat. This enables !Pig to easily read/write data in new storage formats as and when an Hadoop InputFormat and OutputFormat is available for them. + The main motivation for the changes in Pig 0.7.0 load/store api is to move closer to using Hadoop's !InputFormat and !OutputFormat classes. This way pig users/developers can create new !LoadFunc and !StoreFunc implementation based on existing Hadoop !InputFormat and !OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the !InputFormat and likewise on the writing end, the complexity of writing will lie in the !OutputFormat. This enables !Pig to easily read/write data in new storage formats as and when an Hadoop !InputFormat and !OutputFormat is available for them. - '''A general note applicable to both LoadFunc and StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (InputFormat/OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' + '''A general note applicable to both !LoadFunc and !StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (!InputFormat/!OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' = How to implement a Loader = [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup || LoadFunc]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: @@ -147, +147 @@ * storeMetadata: This interface has methods to interact with metadata systems to store schema and store statistics. This interface is truely optional and should only be implemented if metadata needs to stored. The methods which need to be overridden in !StoreFunc are explained below: - * getOutputFormat(): This method will be called by Pig to get the !OutputFormat used by the storer. The methods in the OutputFormat (and underlying RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the OutputFormat is a hadoop packaged one, the implementation should use the new API based one in org.apache.hadoop.mapreduce. If it is a custom OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects. + * getOutputFormat(): This method will be called by Pig to get the !OutputFormat used by the storer. The methods in the !OutputFormat (and underlying !RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the !OutputFormat is a hadoop packaged one, the implementation should use the new API based one in org.apache.hadoop.mapreduce. If it is a custom !OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the !OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects. - * setStoreLocation(): This method is called by Pig to communicate the store location to the storer. The storer should use this method to communicate the same information to the underlying OutputFormat. This method is called multiple times by pig -
[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=8rev2=9 -- The main motivation for the changes in Pig 0.7.0 load/store api is to move closer to using Hadoop's !InputFormat and !OutputFormat classes. This way pig users/developers can create new !LoadFunc and !StoreFunc implementation based on existing Hadoop !InputFormat and !OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the !InputFormat and likewise on the writing end, the complexity of writing will lie in the !OutputFormat. This enables !Pig to easily read/write data in new storage formats as and when an Hadoop !InputFormat and !OutputFormat is available for them. - '''A general note applicable to both !LoadFunc and !StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (!InputFormat/!OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' + '''A general note applicable to both LoadFunc and StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (InputFormat/OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' = How to implement a Loader = - [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup || LoadFunc]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: + [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup | LoadFunc]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: * !LoadMetadata has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for loader implementations to communicate the schema of the data back to pig. If a loader implementation returns data comprised of fields of real types (rather than !DataByteArray fields), it should provide the schema describing the data returned through the getSchema() method. The other methods are concerned with other types of metadata like partition keys and statistics. Implementations can return null return values for these methods if they are not applicable for that implementation. * !LoadPushDown has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data.If a loader implementation is able to efficiently return only required fields, it should implement !LoadPushDown to improve query performance. * !LoadCaster has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from !DataByteArray fields to other types need to be supported. @@ -26, +26 @@ * getNext() :The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in the new API, this is the method wherein the implementation will use the the underlying !RecordReader and construct a tuple The following methods have default implementations in !LoadFunc and should be overridden only if needed: - * setUdfContextSignature():This method will be called by Pig both in the front end and back end to pass a unique signature to the Loader. The signature can be used to store into the !UDFContext any information which the Loader needs to store between various method invocations in the front end and back end. A use case is to store !RequiredFieldList passed to it in !LoadPushDown.pushProjection(RequiredFieldList) for use in the back end before returning tuples in getNext(). The default implementation in !LoadFunc has an empty body. This method will be called before other methods. + * setUdfContextSignature():This method will be called by Pig both in the front end and back end to pass a unique signature to the Loader. The signature can be used to store into the !UDFContext any information which the Loader needs to store between various method invocations in the front end and back
[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=9rev2=10 -- The main motivation for the changes in Pig 0.7.0 load/store api is to move closer to using Hadoop's !InputFormat and !OutputFormat classes. This way pig users/developers can create new !LoadFunc and !StoreFunc implementation based on existing Hadoop !InputFormat and !OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the !InputFormat and likewise on the writing end, the complexity of writing will lie in the !OutputFormat. This enables !Pig to easily read/write data in new storage formats as and when an Hadoop !InputFormat and !OutputFormat is available for them. - '''A general note applicable to both LoadFunc and StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (InputFormat/OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' + '''A general note applicable to both !LoadFunc and !StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (!InputFormat/OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' = How to implement a Loader = [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup | LoadFunc]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: @@ -26, +26 @@ * getNext() :The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in the new API, this is the method wherein the implementation will use the the underlying !RecordReader and construct a tuple The following methods have default implementations in !LoadFunc and should be overridden only if needed: - * setUdfContextSignature():This method will be called by Pig both in the front end and back end to pass a unique signature to the Loader. The signature can be used to store into the !UDFContext any information which the Loader needs to store between various method invocations in the front end and back end. A use case is to store !RequiredFieldList passed to it in !LoadPushDown.pushProjection(!RequiredFieldList) for use in the back end before returning tuples in getNext(). The default implementation in !LoadFunc has an empty body. This method will be called before other methods. + * setUdfContextSignature():This method will be called by Pig both in the front end and back end to pass a unique signature to the Loader. The signature can be used to store into the UDFContext any information which the Loader needs to store between various method invocations in the front end and back end. A use case is to store !RequiredFieldList passed to it in !LoadPushDown.pushProjection(!RequiredFieldList) for use in the back end before returning tuples in getNext(). The default implementation in !LoadFunc has an empty body. This method will be called before other methods. * relativeToAbsolutePath():Pig runtime will call this method to allow the Loader to convert a relative load location to an absolute location. The default implementation provided in !LoadFunc handles this for !FileSystem locations. If the load source is something else, loader implementation may choose to override this. == Example Implementation == @@ -157, +157 @@ * relToAbsPathForStoreLocation(): Pig runtime will call this method to allow the Storer to convert a relative store location to an absolute location. An implementation is provided in !StoreFunc which handles this for FileSystem based locations. * checkSchema(): A Store function should implement this function to check that a given schema describing the data to be written is acceptable to it. The default implementation in !StoreFunc has an empty body. This method will be called before any calls to setStoreLocation(). - == Example Implementation == + == Example Implementation == The storer implementation in the example is a storer for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - this is similar to current !PigStorage storer in Pig. The new implementation uses an existing Hadoop supported !OutputFormat - TextOutputFormat as the underlying !OutputFormat. {{{
[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=10rev2=11 -- = How to implement a Loader = [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup | LoadFunc]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: - * !LoadMetadata has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for loader implementations to communicate the schema of the data back to pig. If a loader implementation returns data comprised of fields of real types (rather than !DataByteArray fields), it should provide the schema describing the data returned through the getSchema() method. The other methods are concerned with other types of metadata like partition keys and statistics. Implementations can return null return values for these methods if they are not applicable for that implementation. + * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup | LoadMetadata]] has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for loader implementations to communicate the schema of the data back to pig. If a loader implementation returns data comprised of fields of real types (rather than !DataByteArray fields), it should provide the schema describing the data returned through the getSchema() method. The other methods are concerned with other types of metadata like partition keys and statistics. Implementations can return null return values for these methods if they are not applicable for that implementation. - * !LoadPushDown has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data.If a loader implementation is able to efficiently return only required fields, it should implement !LoadPushDown to improve query performance. + * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup | LoadPushDown]] has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data.If a loader implementation is able to efficiently return only required fields, it should implement !LoadPushDown to improve query performance. - * !LoadCaster has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from !DataByteArray fields to other types need to be supported. + * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadCaster.java?view=markup | LoadCaster]] has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from !DataByteArray fields to other types need to be supported. The !LoadFunc abstract class is the main class to extend to implement a loader. The methods which need to be overriden are explained below: * getInputFormat() :This method will be called by Pig to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying !RecordReader) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the !InputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be implemented using the new API in org.apache.hadoop.mapreduce. @@ -144, +144 @@ = How to implement a Storer = [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreFunc.java?view=markup | StoreFunc]] abstract class has the main methods for storing data and for most use case it might suffice to extend it. There is an optional interface which can be implemented to achieve extended functionality: - * storeMetadata: This interface has methods to
[Pig Wiki] Update of Pig070LoadStoreHowTo by PradeepK amath
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The Pig070LoadStoreHowTo page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diffrev1=11rev2=12 -- = Overview = This page describes how to go about writing Load functions and Store functions using the API available in Pig 0.7.0. - The main motivation for the changes in Pig 0.7.0 load/store api is to move closer to using Hadoop's !InputFormat and !OutputFormat classes. This way pig users/developers can create new !LoadFunc and !StoreFunc implementation based on existing Hadoop !InputFormat and !OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the !InputFormat and likewise on the writing end, the complexity of writing will lie in the !OutputFormat. This enables !Pig to easily read/write data in new storage formats as and when an Hadoop !InputFormat and !OutputFormat is available for them. + The main motivation for the changes in Pig 0.7.0 load/store api is to move closer to using Hadoop's !InputFormat and !OutputFormat classes. This way pig users/developers can create new !LoadFunc and !StoreFunc implementation based on existing Hadoop !InputFormat and !OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the !InputFormat and likewise on the writing end, the complexity of writing will lie in the !OutputFormat. This enables Pig to easily read/write data in new storage formats as and when an Hadoop !InputFormat and !OutputFormat is available for them. '''A general note applicable to both !LoadFunc and !StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (!InputFormat/OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' = How to implement a Loader = - [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup | LoadFunc]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: + [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup | LoadFunc]] abstract class has the main methods for loading data and for most use cases it would suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup | LoadMetadata]] has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for loader implementations to communicate the schema of the data back to pig. If a loader implementation returns data comprised of fields of real types (rather than !DataByteArray fields), it should provide the schema describing the data returned through the getSchema() method. The other methods are concerned with other types of metadata like partition keys and statistics. Implementations can return null return values for these methods if they are not applicable for that implementation. - * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup | LoadPushDown]] has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data.If a loader implementation is able to efficiently return only required fields, it should implement !LoadPushDown to improve query performance. + * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup | LoadPushDown]] has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data. If a loader implementation is able to efficiently return only required fields, it should implement !LoadPushDown to improve query performance. (Irrespective of whether the implementation can or cannot return only the required fields, if the implementation also implements getSchema(), the schema returned in getSchema()
svn commit: r919628 - in /hadoop/pig/trunk/contrib/zebra: CHANGES.txt src/java/org/apache/hadoop/zebra/pig/SchemaConverter.java
Author: yanz Date: Fri Mar 5 21:32:24 2010 New Revision: 919628 URL: http://svn.apache.org/viewvc?rev=919628view=rev Log: PIG-1276: Pig resource schema interface changed, so Zebra needs to catch exception thrown from the new interfaces. (xuefuz via yanz) Modified: hadoop/pig/trunk/contrib/zebra/CHANGES.txt hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/pig/SchemaConverter.java Modified: hadoop/pig/trunk/contrib/zebra/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/zebra/CHANGES.txt?rev=919628r1=919627r2=919628view=diff == --- hadoop/pig/trunk/contrib/zebra/CHANGES.txt (original) +++ hadoop/pig/trunk/contrib/zebra/CHANGES.txt Fri Mar 5 21:32:24 2010 @@ -60,6 +60,8 @@ BUG FIXES +PIG-1276: Pig resource schema interface changed, so Zebra needs to catch exception thrown from the new interfaces. (xuefuz via yanz) + PIG-1256: Bag field should always contain a tuple type as the field schema in ResourceSchema object converted from Zebra Schema (xuefuz via yanz) PIG-1227: Throw exception if column group meta file is missing for an unsorted table (yanz) Modified: hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/pig/SchemaConverter.java URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/pig/SchemaConverter.java?rev=919628r1=919627r2=919628view=diff == --- hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/pig/SchemaConverter.java (original) +++ hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/pig/SchemaConverter.java Fri Mar 5 21:32:24 2010 @@ -18,6 +18,8 @@ package org.apache.hadoop.zebra.pig; +import java.io.IOException; + import org.apache.hadoop.zebra.parser.ParseException; import org.apache.hadoop.zebra.schema.ColumnType; import org.apache.hadoop.zebra.schema.Schema.ColumnSchema; @@ -136,7 +138,8 @@ return schema; } -public static ResourceSchema convertToResourceSchema(org.apache.hadoop.zebra.schema.Schema tSchema) { +public static ResourceSchema convertToResourceSchema(org.apache.hadoop.zebra.schema.Schema tSchema) +throws IOException { if( tSchema == null ) return null; @@ -154,8 +157,8 @@ return rSchema; } -public static ResourceFieldSchema convertToResourceFieldSchema( -ColumnSchema cSchema) { +private static ResourceFieldSchema convertToResourceFieldSchema( +ColumnSchema cSchema) throws IOException { ResourceFieldSchema field = new ResourceFieldSchema(); if( cSchema.getType() ==ColumnType.ANY cSchema.getName().isEmpty() ) { // For anonymous column
svn commit: r919634 [3/3] - in /hadoop/pig/trunk: src/org/apache/pig/ src/org/apache/pig/backend/hadoop/executionengine/ src/org/apache/pig/experimental/logical/ src/org/apache/pig/experimental/logica
Modified: hadoop/pig/trunk/test/org/apache/pig/test/TestExperimentalLogToPhyTranslationVisitor.java URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/test/org/apache/pig/test/TestExperimentalLogToPhyTranslationVisitor.java?rev=919634r1=919633r2=919634view=diff == --- hadoop/pig/trunk/test/org/apache/pig/test/TestExperimentalLogToPhyTranslationVisitor.java (original) +++ hadoop/pig/trunk/test/org/apache/pig/test/TestExperimentalLogToPhyTranslationVisitor.java Fri Mar 5 21:55:19 2010 @@ -17,14 +17,22 @@ */ package org.apache.pig.test; +import java.io.ByteArrayOutputStream; import java.io.IOException; +import java.io.PrintStream; import java.util.List; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator; +import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Add; +import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LessThanExpr; +import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Mod; +import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Multiply; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast; +import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PONegative; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject; +import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Subtract; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter; import org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach; @@ -36,15 +44,29 @@ import org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.ConstantExpression; import org.apache.pig.data.DataType; import org.apache.pig.experimental.logical.LogicalPlanMigrationVistor; +import org.apache.pig.experimental.logical.expression.AddExpression; +import org.apache.pig.experimental.logical.expression.DivideExpression; +import org.apache.pig.experimental.logical.expression.IsNullExpression; import org.apache.pig.experimental.logical.expression.LogicalExpression; +import org.apache.pig.experimental.logical.expression.LogicalExpressionPlan; +import org.apache.pig.experimental.logical.expression.ModExpression; +import org.apache.pig.experimental.logical.expression.MultiplyExpression; +import org.apache.pig.experimental.logical.expression.NegativeExpression; +import org.apache.pig.experimental.logical.expression.NotExpression; +import org.apache.pig.experimental.logical.expression.ProjectExpression; +import org.apache.pig.experimental.logical.expression.SubtractExpression; +import org.apache.pig.experimental.logical.optimizer.PlanPrinter; import org.apache.pig.experimental.logical.optimizer.UidStamper; +import org.apache.pig.experimental.logical.relational.LOFilter; +import org.apache.pig.experimental.logical.relational.LOForEach; +import org.apache.pig.experimental.logical.relational.LOGenerate; +import org.apache.pig.experimental.logical.relational.LOLoad; import org.apache.pig.experimental.logical.relational.LogToPhyTranslationVisitor; import org.apache.pig.experimental.logical.relational.LogicalRelationalOperator; import org.apache.pig.experimental.logical.relational.LogicalSchema; import org.apache.pig.experimental.logical.relational.LogicalSchema.LogicalFieldSchema; -import org.apache.pig.experimental.plan.Operator; import org.apache.pig.experimental.plan.OperatorPlan; -import org.apache.pig.experimental.plan.PlanVisitor; +import org.apache.pig.impl.logicalLayer.LOIsNull; import org.apache.pig.impl.logicalLayer.LogicalPlan; import org.apache.pig.impl.plan.VisitorException; import org.apache.pig.test.utils.LogicalPlanTester; @@ -466,7 +488,6 @@ PhysicalPlan phyPlan = translatePlan(newLogicalPlan); assertEquals(phyPlan.size(), 3); -POLoad load = (POLoad)phyPlan.getRoots().get(0); assertEquals(phyPlan.getLeaves().get(0).getClass(), POStore.class); POForEach foreach = (POForEach)phyPlan.getSuccessors(phyPlan.getRoots().get(0)).get(0); @@ -476,13 +497,13 @@ assertEquals(inner.size(), 1); POProject prj = (POProject)inner.getRoots().get(0); assertEquals(prj.getColumn(), 0); -assertEquals(prj.getInputs().get(0), load); - +