Re: [VOTE] Release Pig 0.3.0 (candidate 0)
Downloaded, ran, ran tutorial, built piggybank. All looks good. +1 Alan. On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. I ran the rat report and made sure that all the source files contain proper headers. (Not attaching the report since it caused trouble with the last release.) Keys used to sign the release candidate are at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS. Please, download and try the release candidate: http://people.apache.org/~olga/pig-0.3.0-candidate-0/. Please, vote by Wednesday, June 24th. Olga
[jira] Created: (PIG-859) Optimizer throw error on self-joins
Optimizer throw error on self-joins --- Key: PIG-859 URL: https://issues.apache.org/jira/browse/PIG-859 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Ashutosh Chauhan Fix For: 0.4.0 Doing self-join results in exception thrown by Optimizer. Consider the following query {code} grunt A = load 'a'; grunt B = Join A by $0, A by $0; grunt explain B; 2009-06-20 15:51:38,303 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1094: Attempt to insert between two nodes that were not connected. Details at logfile: pig_1245538027026.log {code} Relevant stack-trace from log-file: {code} Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2047: Internal error. Unable to introduce split operators. at org.apache.pig.impl.logicalLayer.optimizer.ImplicitSplitInserter.transform(ImplicitSplitInserter.java:163) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:163) at org.apache.pig.PigServer.compileLp(PigServer.java:844) at org.apache.pig.PigServer.compileLp(PigServer.java:781) at org.apache.pig.PigServer.getStorePlan(PigServer.java:723) at org.apache.pig.PigServer.explain(PigServer.java:566) ... 8 more Caused by: org.apache.pig.impl.plan.PlanException: ERROR 1094: Attempt to insert between two nodes that were not connected. at org.apache.pig.impl.plan.OperatorPlan.doInsertBetween(OperatorPlan.java:500) at org.apache.pig.impl.plan.OperatorPlan.insertBetween(OperatorPlan.java:480) at org.apache.pig.impl.logicalLayer.optimizer.ImplicitSplitInserter.transform(ImplicitSplitInserter.java:139) ... 13 more {code} A possible workaround is: {code} grunt A = load 'a'; grunt B = load 'a'; grunt C = join A by $0, B by $0; grunt explain C; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported
[ https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722701#action_12722701 ] Santhosh Srinivasan commented on PIG-773: - Comments: 1. Minor comment - the comment on the empty productions all have the same text For tuple and bag, it should be changed to tuple and bag respectively {code} + |{ } // Match the empty content in map. {code} 2. I am not sure about the test case testEmptyBagConstRecursive. Here the bag contains an empty tuple. As a result, the field schema for the bag should contain the schema of the empty tuple. The test case will probably fail. {code} +@Test +public void testEmptyBagConstRecursive() throws FrontendException{ + +LogicalPlan lp = buildPlan(a = foreach (load 'b') generate {()};); +LOForEach foreach = (LOForEach) lp.getLeaves().get(0); + +Schema.FieldSchema bagFs = new Schema.FieldSchema(null,null,DataType.BAG); +Schema expectedSchema = new Schema(bagFs); + +assertTrue(Schema.equals(foreach.getSchema(), expectedSchema, false, true)); +} {code} 3. There are no tests that check if the empty constants are actually created, i.e., there are no checks for expected empty constants. The test below checks if the parser can parse the new syntax for empty constants. In addition, the values generated by the parser have to checked against expected values for these constants. {code} +@Test +public void testRandomEmptyConst(){ +// Various random scripts to test recursive nature of parser with empty constants. + +buildPlan(a = foreach (load 'b') generate {({})};); +buildPlan(a = foreach (load 'b') generate ({()});); +buildPlan(a = foreach (load 'b') generate {(),()};); +buildPlan(a = foreach (load 'b') generate ({},{});); +buildPlan(a = foreach (load 'b') generate ((),());); +buildPlan(a = foreach (load 'b') generate ([],[]);); +buildPlan(a = foreach (load 'b') generate {({},{})};); +buildPlan(a = foreach (load 'b') generate {([],[])};); +buildPlan(a = foreach (load 'b') generate (({},{}));); +buildPlan(a = foreach (load 'b') generate (([],[]));); +} {code} Empty complex constants (empty bag, empty tuple and empty map) should be supported -- Key: PIG-773 URL: https://issues.apache.org/jira/browse/PIG-773 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Priority: Minor Attachments: pig-773.patch, pig-773_v2.patch We should be able to create empty bag constant using {}, empty tuple constant using (), empty map constant using [] within a pig script -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722706#action_12722706 ] Ashutosh Chauhan commented on PIG-820: -- In the patch RandomSampleLoader is marked as serializable and loader field in it is marked as transient. Since loader is initialized in constructor and is used later on findbugs is complaining : This class contains a field that is updated at multiple places in the class, thus it seems to be part of the state of the class.However, since the field is marked as transient and not set in readObject or readResolve, it will contain the default value in any deserialized instance of the class. However there is no need for RandomSampleLoader to implement Serializable anyway (and thus loader to be marked as transient) because loader is reconstructed from FunSpec later on. Because of this reason, both PigStorage and BinStorage also doesnt implement serializable. Will be submitting a new patch with the required changes. PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Alan Gates Attachments: pig-820.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Pig 0.3.0 (candidate 0)
On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. +1 I downloaded the release, validated checksums and ran the unit-tests successfully. Arun
Re: [VOTE] Release Pig 0.3.0 (candidate 0)
+1. On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. I ran the rat report and made sure that all the source files contain proper headers. (Not attaching the report since it caused trouble with the last release.) Keys used to sign the release candidate are at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS. Please, download and try the release candidate: http://people.apache.org/~olga/pig-0.3.0-candidate-0/. Please, vote by Wednesday, June 24th. Olga
RE: [VOTE] Release Pig 0.3.0 (candidate 0)
+1 for release. -Pradeep -Original Message- From: Alan Gates [mailto:ga...@yahoo-inc.com] Sent: Monday, June 22, 2009 9:30 AM To: priv...@hadoop.apache.org Cc: pig-dev@hadoop.apache.org; gene...@hadoop.apache.org Subject: Re: [VOTE] Release Pig 0.3.0 (candidate 0) Downloaded, ran, ran tutorial, built piggybank. All looks good. +1 Alan. On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. I ran the rat report and made sure that all the source files contain proper headers. (Not attaching the report since it caused trouble with the last release.) Keys used to sign the release candidate are at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS. Please, download and try the release candidate: http://people.apache.org/~olga/pig-0.3.0-candidate-0/. Please, vote by Wednesday, June 24th. Olga
requirements for Pig 1.0?
I know there was some discussion of making the types release (0.2) a Pig 1 release, but that got nixed. There wasn't a similar discussion on 0.3. Has the list of want-to-haves for Pig 1.0 been discussed since?
[jira] Resolved: (PIG-600) PiggyBank compilation instructions don't work
[ https://issues.apache.org/jira/browse/PIG-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-600. Resolution: Fixed Thanks, Dmitry for your help with this issue! PiggyBank compilation instructions don't work - Key: PIG-600 URL: https://issues.apache.org/jira/browse/PIG-600 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: David Ciemiewicz I know that PiggyBank is as-is but the instructions are incomplete and should be complete enough (all steps) required to compile PiggyBank. http://wiki.apache.org/pig/PiggyBank I checked out the types branch version of PiggyBank by modifying the instructions to check out: svn co http://svn.apache.org/repos/asf/hadoop/pig/branches/types/contrib/piggybank/ At step 2 it says: To build a jar file that contains all available user defined functions (UDFs), please follow the steps: 1. Checkout UDF code: svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank 2. Build the jar file: from trunk/contrib/piggybank/java directory run ant. This will generate piggybank.jar in the same directory. So I went into the piggybank/java directory and and ran ant and got the following errors: {code} -bash-3.00$ ant Buildfile: build.xml init: compile: [echo] *** Compiling Pig UDFs *** [javac] Compiling 70 source files to /homes/ciemo/piggybank/java/build/classes [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:25: cannot find symbol [javac] symbol : class EvalFunc [javac] location: package org.apache.pig [javac] import org.apache.pig.EvalFunc; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:26: cannot find symbol [javac] symbol : class FuncSpec [javac] location: package org.apache.pig [javac] import org.apache.pig.FuncSpec; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:27: package org.apache.pig.data does not exist [javac] import org.apache.pig.data.Tuple; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:28: package org.apache.pig.impl.logicalLayer.schema does not exist [javac] import org.apache.pig.impl.logicalLayer.schema.Schema; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:29: package org.apache.pig.data does not exist [javac] import org.apache.pig.data.DataType; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:30: package org.apache.pig.impl.logicalLayer does not exist [javac] import org.apache.pig.impl.logicalLayer.FrontendException; [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:31: package org.apache.pig.impl.util does not exist [javac] import org.apache.pig.impl.util.WrappedIOException; [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:61: cannot find symbol [javac] symbol: class EvalFunc [javac] public class ABS extends EvalFuncDouble{ [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:67: cannot find symbol [javac] symbol : class Tuple [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public Double exec(Tuple input) throws IOException { [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85: cannot find symbol [javac] symbol : class Schema [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public Schema outputSchema(Schema input) { [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85: cannot find symbol [javac] symbol : class Schema [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public Schema outputSchema(Schema input) { [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:93: cannot find symbol
[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-851: Status: In Progress (was: Patch Available) Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Attachments: patch_815.txt When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-851: Patch Info: [Patch Available] Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Attachments: patch_815.txt When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-860) Load split by 'file' - not documented in pig latin reference manual
Load split by 'file' - not documented in pig latin reference manual --- Key: PIG-860 URL: https://issues.apache.org/jira/browse/PIG-860 Project: Pig Issue Type: Task Components: documentation Reporter: Thejas M Nair Priority: Minor split by 'file' is not documented in Pig Latin Reference Manual (http://hadoop.apache.org/pig/docs/r0.2.0/piglatin.html). There is a description about the option here -http://wiki.apache.org/pig/PigStreamingFunctionalSpec (section 4.3). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-860) Load split by 'file' - not documented in pig latin reference manual
[ https://issues.apache.org/jira/browse/PIG-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722760#action_12722760 ] Thejas M Nair edited comment on PIG-860 at 6/22/09 11:56 AM: - Use case, from discussion in user mailing list - -- Forwarded Message From: Pradeep Kamath prade...@yah...c.com Reply-To: pig-u...@hadoop.apache.org pig-u...@hadoop.apache.org Date: Mon, 22 Jun 2009 11:04:04 -0700 To: pig-u...@hadoop.apache.org Conversation: How to make the pig unsplittable ? Subject: RE: How to make the pig unsplittable ? You can load using the following syntax: A = load 'inputfile' split by 'file'; The split by 'file' will ensure issplittable is set to false and the input is not split. -Pradeep -Original Message- From: zhang jianfeng [mailto:zjff..ail.com] Sent: Sunday, June 21, 2009 10:48 PM To: pig-u...@hadoop.apache.org Subject: How to make the pig unsplittable ? Hi all, Because of my input file format, the first line of file is the definition of each field, and then lines of records. So I did not found one good method of using customer slicer. So I'd like to make the pig do not split my file, but I did not found an easy way. Now I have to change the code in POLoad and LOLoad, make the variable isSplitable false. Is there any easier way to make it unsplittable, such as configuration ? Thank you for any help. Jeff Zhang -- End of Forwarded Message was (Author: thejas): Use case, from discussion in user mailing list - -- Forwarded Message From: Pradeep Kamath prade...@yahoo-inc.com Reply-To: pig-u...@hadoop.apache.org pig-u...@hadoop.apache.org Date: Mon, 22 Jun 2009 11:04:04 -0700 To: pig-u...@hadoop.apache.org Conversation: How to make the pig unsplittable ? Subject: RE: How to make the pig unsplittable ? You can load using the following syntax: A = load 'inputfile' split by 'file'; The split by 'file' will ensure issplittable is set to false and the input is not split. -Pradeep -Original Message- From: zhang jianfeng [mailto:zjf...@gmail.com] Sent: Sunday, June 21, 2009 10:48 PM To: pig-u...@hadoop.apache.org Subject: How to make the pig unsplittable ? Hi all, Because of my input file format, the first line of file is the definition of each field, and then lines of records. So I did not found one good method of using customer slicer. So I'd like to make the pig do not split my file, but I did not found an easy way. Now I have to change the code in POLoad and LOLoad, make the variable isSplitable false. Is there any easier way to make it unsplittable, such as configuration ? Thank you for any help. Jeff Zhang -- End of Forwarded Message Load split by 'file' - not documented in pig latin reference manual --- Key: PIG-860 URL: https://issues.apache.org/jira/browse/PIG-860 Project: Pig Issue Type: Task Components: documentation Reporter: Thejas M Nair Priority: Minor split by 'file' is not documented in Pig Latin Reference Manual (http://hadoop.apache.org/pig/docs/r0.2.0/piglatin.html). There is a description about the option here -http://wiki.apache.org/pig/PigStreamingFunctionalSpec (section 4.3). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-820: - Attachment: pig-820_v2.patch Patch which fixes findbugs warning. PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.4.0 Attachments: pig-820.patch, pig-820_v2.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-820: - Status: Patch Available (was: Open) Submitting to hudson PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Ashutosh Chauhan Fix For: 0.4.0 Attachments: pig-820.patch, pig-820_v2.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-820: - Fix Version/s: 0.4.0 Assignee: Ashutosh Chauhan (was: Alan Gates) Status: Open (was: Patch Available) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Ashutosh Chauhan Fix For: 0.4.0 Attachments: pig-820.patch, pig-820_v2.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-820: - Attachment: pig-820_v3.patch PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Ashutosh Chauhan Fix For: 0.4.0 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Pig 0.3.0 (candidate 0)
Unit test pass on Windows. +1 for release. Daniel - Original Message - From: Santhosh Srinivasan s...@yahoo-inc.com To: pig-dev@hadoop.apache.org Sent: Monday, June 22, 2009 2:15 PM Subject: RE: [VOTE] Release Pig 0.3.0 (candidate 0) I was able to download the archive, verify the checksum and run the unit test cases successfully. However, I was not able to run the tutorial. One of the following has to be fixed: 1. The release notes should not reference the wiki for the tutorial. 2. The wiki for the tutorial has to be updated to allow users to run the tutorial successfully. Pending the fix for the aforementioned problems, +1 for the release. Thanks, Santhosh -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Thursday, June 18, 2009 12:30 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org; gene...@hadoop.apache.org Subject: [VOTE] Release Pig 0.3.0 (candidate 0) Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. I ran the rat report and made sure that all the source files contain proper headers. (Not attaching the report since it caused trouble with the last release.) Keys used to sign the release candidate are at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS. Please, download and try the release candidate: http://people.apache.org/~olga/pig-0.3.0-candidate-0/. Please, vote by Wednesday, June 24th. Olga