Re: requirements for Pig 1.0?
For 1.0 - complete Owl? http://wiki.apache.org/pig/Metadata Russell Jurney rjur...@cloudstenography.com On Jun 23, 2009, at 4:40 PM, Alan Gates wrote: I don't believe there's a solid list of want to haves for 1.0. The big issue I see is that there are too many interfaces that are still shifting, such as: 1) Data input/output formats. The way we do slicing (that is, user provided InputFormats) and the equivalent outputs aren't yet solid. They are still too tied to load and store functions. We need to break those out and understand how they will be expressed in the language. Related to this is the semantics of how Pig interacts with non-file based inputs and outputs. We have a suggestion of moving to URLs, but we haven't finished test driving this to see if it will really be what we want. 2) The memory model. While technically the choices we make on how to represent things in memory are internal, the reality is that these changes may affect the way we read and write tuples and bags, which in turn may affect our load, store, eval, and filter functions. 3) SQL. We're working on introducing SQL soon, and it will take it a few releases to be fully baked. 4) Much better error messages. In 0.2 our error messages made a leap forward, but before we can claim to be 1.0 I think they need to make 2 more leaps: 1) they need to be written in a way end users can understand them instead of in a way engineers can understand them, including having sufficient error documentation with suggested courses of action, etc.; 2) they need to be much better at tying errors back to where they happened in the script, right now if one of the MR jobs associated with a Pig Latin script fails there is no way to know what part of the script it is associated with. There are probably others, but those are the ones I can think of off the top of my head. The summary from my viewpoint is we still have several 0.x releases before we're ready to consider 1.0. It would be nice to be 1.0 not too long after Hadoop is, which still gives us at least 6-9 months. Alan. On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote: I know there was some discussion of making the types release (0.2) a Pig 1 release, but that got nixed. There wasn't a similar discussion on 0.3. Has the list of want-to-haves for Pig 1.0 been discussed since?
[jira] Commented: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723508#action_12723508 ] Hadoop QA commented on PIG-851: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411309/patch_815.txt against trunk revision 787908. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/98/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/98/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/98/console This message is automatically generated. Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: patch_815.txt When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-Patch-minerva.apache.org #98
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/98/changes Changes: [daijy] PIG-832: Make import list configurable -- [...truncated 94726 lines...] [exec] [junit] 09/06/24 09:14:53 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 [exec] [junit] 09/06/24 09:14:53 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 [exec] [junit] 09/06/24 09:14:53 INFO dfs.DataNode: Deleting block blk_-8813130154407314932_1005 file dfs/data/data2/current/blk_-8813130154407314932 [exec] [junit] 09/06/24 09:14:53 INFO dfs.DataNode: Deleting block blk_-6716071823453029809_1004 file dfs/data/data1/current/blk_-6716071823453029809 [exec] [junit] 09/06/24 09:14:53 INFO dfs.DataNode: Deleting block blk_602371418698131255_1006 file dfs/data/data1/current/blk_602371418698131255 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Deleting block blk_-8813130154407314932_1005 file dfs/data/data8/current/blk_-8813130154407314932 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Deleting block blk_602371418698131255_1006 file dfs/data/data7/current/blk_602371418698131255 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* ask 127.0.0.1:53034 to delete blk_-6716071823453029809_1004 blk_602371418698131255_1006 blk_-8813130154407314932_1005 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* ask 127.0.0.1:39667 to delete blk_-6716071823453029809_1004 [exec] [junit] 09/06/24 09:14:54 INFO mapReduceLayer.JobControlCompiler: Setting up single store job [exec] [junit] 09/06/24 09:14:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_200906240914_0002/job.jar. blk_1281412583459416781_1012 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block blk_1281412583459416781_1012 src: /127.0.0.1:60705 dest: /127.0.0.1:40234 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block blk_1281412583459416781_1012 src: /127.0.0.1:42301 dest: /127.0.0.1:53034 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block blk_1281412583459416781_1012 src: /127.0.0.1:48341 dest: /127.0.0.1:39667 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Received block blk_1281412583459416781_1012 of size 1428031 from /127.0.0.1 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:39667 is added to blk_1281412583459416781_1012 size 1428031 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: PacketResponder 0 for block blk_1281412583459416781_1012 terminating [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Received block blk_1281412583459416781_1012 of size 1428031 from /127.0.0.1 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: PacketResponder 1 for block blk_1281412583459416781_1012 terminating [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:53034 is added to blk_1281412583459416781_1012 size 1428031 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Received block blk_1281412583459416781_1012 of size 1428031 from /127.0.0.1 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: PacketResponder 2 for block blk_1281412583459416781_1012 terminating [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:40234 is added to blk_1281412583459416781_1012 size 1428031 [exec] [junit] 09/06/24 09:14:54 INFO fs.FSNamesystem: Increasing replication for file /tmp/hadoop-hudson/mapred/system/job_200906240914_0002/job.jar. New replication is 2 [exec] [junit] 09/06/24 09:14:54 INFO fs.FSNamesystem: Reducing replication for file /tmp/hadoop-hudson/mapred/system/job_200906240914_0002/job.jar. New replication is 2 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_200906240914_0002/job.split. blk_-1411835332935289445_1013 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block blk_-1411835332935289445_1013 src: /127.0.0.1:33410 dest: /127.0.0.1:41639 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block blk_-1411835332935289445_1013 src: /127.0.0.1:60709 dest: /127.0.0.1:40234 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block blk_-1411835332935289445_1013 src: /127.0.0.1:48344 dest: /127.0.0.1:39667 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Received block
Build failed in Hudson: Pig-Patch-minerva.apache.org #99
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/99/ -- started Building remotely on minerva.apache.org (Ubuntu) Updating http://svn.apache.org/repos/asf/hadoop/pig/trunk Fetching 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch' at -1 into 'http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/trunk/test/bin' At revision 787980 At revision 787980 no change for http://svn.apache.org/repos/asf/hadoop/pig/trunk since the previous build no change for http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch since the previous build [Pig-Patch-minerva.apache.org] $ /bin/bash /tmp/hudson8576193962376397285.sh /home/hudson/tools/java/latest1.6/bin/java Buildfile: build.xml check-for-findbugs: findbugs.check: java5.check: forrest.check: hudson-test-patch: [exec] [exec] [exec] == [exec] == [exec] Testing patch for PIG-862. [exec] == [exec] == [exec] [exec] [exec] Reverted 'src/org/apache/pig/data/DataType.java' [exec] [exec] Fetching external item into 'test/bin' [exec] Atest/bin/test-patch.sh [exec] Updated external to revision 787980. [exec] [exec] Updated to revision 787980. [exec] PIG-862 patch is being downloaded at Wed Jun 24 11:24:30 UTC 2009 from [exec] http://issues.apache.org/jira/secure/attachment/12411560/PIG-862.patch [exec] [exec] [exec] == [exec] == [exec] Pre-building trunk to determine trunk number [exec] of release audit, javac, and Findbugs warnings. [exec] == [exec] == [exec] [exec] [exec] /home/hudson/tools/ant/latest/bin/ant -Djava5.home=/home/hudson/tools/java/latest1.5 -Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= releaseaudit http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/patchprocess/trunkReleaseAuditWarnings.txt 21 [exec] /home/hudson/tools/ant/latest/bin/ant -Djavac.args=-Xlint -Xmaxwarns 1000 -Declipse.home=/home/nigel/tools/eclipse/latest -Djava5.home=/home/hudson/tools/java/latest1.5 -Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= clean tar http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/patchprocess/trunkJavacWarnings.txt 21 [exec] /home/hudson/tools/ant/latest/bin/ant -Dfindbugs.home=/home/nigel/tools/findbugs/latest -Djava5.home=/home/hudson/tools/java/latest1.5 -Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= findbugs /dev/null 21 [exec] [exec] [exec] == [exec] == [exec] Checking there are no @author tags in the patch. [exec] == [exec] == [exec] [exec] [exec] There appear to be 0 @author tags in the patch. [exec] [exec] [exec] == [exec] == [exec] Checking there are new or changed tests in the patch. [exec] == [exec] == [exec] [exec] [exec] There appear to be 0 test files referenced in the patch. [exec] The patch appears to be a documentation patch that doesn't require tests. [exec] [exec] [exec] == [exec] == [exec] Applying patch. [exec] == [exec] == [exec] [exec] [exec] (Stripping trailing CRs from patch.) [exec] can't find file to patch at input line 5 [exec] Perhaps you used the wrong -p or --strip option? [exec] The text leading up to this was: [exec] -- [exec]
[jira] Commented: (PIG-862) Pig Site - 0.3.0 updates
[ https://issues.apache.org/jira/browse/PIG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723528#action_12723528 ] Hadoop QA commented on PIG-862: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411560/PIG-862.patch against trunk revision 787908. +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/99/console This message is automatically generated. Pig Site - 0.3.0 updates Key: PIG-862 URL: https://issues.apache.org/jira/browse/PIG-862 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.3.0 Reporter: Corinne Chandel Attachments: PIG-862.patch Updates for Pig Site change home tab to project tab added search bar cleaned up logo image -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723554#action_12723554 ] Hudson commented on PIG-832: Integrated in Pig-trunk #484 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/484/]) : Make import list configurable Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-832-1.patch, PIG-832-2.patch Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723555#action_12723555 ] Hudson commented on PIG-734: Integrated in Pig-trunk #484 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/484/]) : Changed maps to only take strings as keys. Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.4.0 Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-851: --- Status: Open (was: Patch Available) Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: patch_815.txt When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-851: --- Attachment: (was: patch_815.txt) Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Jeff Zhang Fix For: 0.4.0 When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-851: --- Status: Patch Available (was: Open) Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: Pig_851_patch.txt When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: requirements for Pig 1.0?
Integration with Owl is something we want for 1.0. I am hopeful that by Pig's 1.0 Owl will have flown the coop and become either a subproject or found a home in Hadoop's common, since it will hopefully be used by multiple other subprojects. Alan. On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote: For 1.0 - complete Owl? http://wiki.apache.org/pig/Metadata Russell Jurney rjur...@cloudstenography.com On Jun 23, 2009, at 4:40 PM, Alan Gates wrote: I don't believe there's a solid list of want to haves for 1.0. The big issue I see is that there are too many interfaces that are still shifting, such as: 1) Data input/output formats. The way we do slicing (that is, user provided InputFormats) and the equivalent outputs aren't yet solid. They are still too tied to load and store functions. We need to break those out and understand how they will be expressed in the language. Related to this is the semantics of how Pig interacts with non-file based inputs and outputs. We have a suggestion of moving to URLs, but we haven't finished test driving this to see if it will really be what we want. 2) The memory model. While technically the choices we make on how to represent things in memory are internal, the reality is that these changes may affect the way we read and write tuples and bags, which in turn may affect our load, store, eval, and filter functions. 3) SQL. We're working on introducing SQL soon, and it will take it a few releases to be fully baked. 4) Much better error messages. In 0.2 our error messages made a leap forward, but before we can claim to be 1.0 I think they need to make 2 more leaps: 1) they need to be written in a way end users can understand them instead of in a way engineers can understand them, including having sufficient error documentation with suggested courses of action, etc.; 2) they need to be much better at tying errors back to where they happened in the script, right now if one of the MR jobs associated with a Pig Latin script fails there is no way to know what part of the script it is associated with. There are probably others, but those are the ones I can think of off the top of my head. The summary from my viewpoint is we still have several 0.x releases before we're ready to consider 1.0. It would be nice to be 1.0 not too long after Hadoop is, which still gives us at least 6-9 months. Alan. On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote: I know there was some discussion of making the types release (0.2) a Pig 1 release, but that got nixed. There wasn't a similar discussion on 0.3. Has the list of want-to-haves for Pig 1.0 been discussed since?
Re: requirements for Pig 1.0?
Alan, any thoughts on performance baselines and benchmarks? I am a little surprised that you think SQL is a requirement for 1.0, since it's essentially an overlay, not core functionality. What about the storage layer rewrite (or is that what you referred to with your first bullet-point)? Also, the subject of making more (or all) operators nestable within a foreach comes up now and then.. would you consider this important for 1.0, or something that can wait? Integration with other languages (a-la PyPig)? The Roadmap on the Wiki is still as of Q3 2007 makes it hard for an outside contributor to know where to jump :-). -D On Wed, Jun 24, 2009 at 10:02 AM, Alan Gates ga...@yahoo-inc.com wrote: Integration with Owl is something we want for 1.0. I am hopeful that by Pig's 1.0 Owl will have flown the coop and become either a subproject or found a home in Hadoop's common, since it will hopefully be used by multiple other subprojects. Alan. On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote: For 1.0 - complete Owl? http://wiki.apache.org/pig/Metadata Russell Jurney rjur...@cloudstenography.com On Jun 23, 2009, at 4:40 PM, Alan Gates wrote: I don't believe there's a solid list of want to haves for 1.0. The big issue I see is that there are too many interfaces that are still shifting, such as: 1) Data input/output formats. The way we do slicing (that is, user provided InputFormats) and the equivalent outputs aren't yet solid. They are still too tied to load and store functions. We need to break those out and understand how they will be expressed in the language. Related to this is the semantics of how Pig interacts with non-file based inputs and outputs. We have a suggestion of moving to URLs, but we haven't finished test driving this to see if it will really be what we want. 2) The memory model. While technically the choices we make on how to represent things in memory are internal, the reality is that these changes may affect the way we read and write tuples and bags, which in turn may affect our load, store, eval, and filter functions. 3) SQL. We're working on introducing SQL soon, and it will take it a few releases to be fully baked. 4) Much better error messages. In 0.2 our error messages made a leap forward, but before we can claim to be 1.0 I think they need to make 2 more leaps: 1) they need to be written in a way end users can understand them instead of in a way engineers can understand them, including having sufficient error documentation with suggested courses of action, etc.; 2) they need to be much better at tying errors back to where they happened in the script, right now if one of the MR jobs associated with a Pig Latin script fails there is no way to know what part of the script it is associated with. There are probably others, but those are the ones I can think of off the top of my head. The summary from my viewpoint is we still have several 0.x releases before we're ready to consider 1.0. It would be nice to be 1.0 not too long after Hadoop is, which still gives us at least 6-9 months. Alan. On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote: I know there was some discussion of making the types release (0.2) a Pig 1 release, but that got nixed. There wasn't a similar discussion on 0.3. Has the list of want-to-haves for Pig 1.0 been discussed since?
Re: requirements for Pig 1.0?
To be clear, going to 1.0 is not about having a certain set of features. It is about stability and usability. When a project declares itself 1.0 it is making some guarantees regarding the stability of its interfaces (in Pig's case this is Pig Latin, UDFs, and command line usage). It is also declaring itself ready for the world at large, not just the brave and the free. New features can come in as experimental once we're 1.0, but the semantics of the language and UDFs can't be shifting (as we've done the last several releases and will continue to do for a bit I think). With that in mind, further comments inlined. On Jun 24, 2009, at 10:18 AM, Dmitriy Ryaboy wrote: Alan, any thoughts on performance baselines and benchmarks? Meaning do we need to reach a certain speed before 1.0? I don't think so. Pig is fast enough now that many people find it useful. We want to continue working to shrink the gap between Pig and MR, but I don't see this as a blocker for 1.0. I am a little surprised that you think SQL is a requirement for 1.0, since it's essentially an overlay, not core functionality. If we were debating today whether to go 1.0, I agree that we would not wait for SQL. But given that we aren't (at least I wouldn't vote for it now) and that SQL will be in soon, it will need to stabilize. What about the storage layer rewrite (or is that what you referred to with your first bullet-point)? To be clear, the Zebra (columnar store stuff) is not a rewrite of the storage layer. It is an additional storage option we want to support. We aren't changing current support for load and store. Also, the subject of making more (or all) operators nestable within a foreach comes up now and then.. would you consider this important for 1.0, or something that can wait? This would be an added feature, not a semantic change in Pig Latin. Integration with other languages (a-la PyPig)? Again, this is a new feature, not a stability issue. The Roadmap on the Wiki is still as of Q3 2007 makes it hard for an outside contributor to know where to jump :-). Agreed. Olga has given me the task of updating this soon. I'm going to try to get to that over the next couple of weeks. This discussion will certainly provide input to that update. Alan.
[jira] Created: (PIG-863) Function (UDF) automatic namespace resolution is really needed
Function (UDF) automatic namespace resolution is really needed -- Key: PIG-863 URL: https://issues.apache.org/jira/browse/PIG-863 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz The Apache PiggyBank documentation says that to reference a function, I need to specify a function as: org.apache.pig.piggybank.evaluation.string.UPPER(text) As in the example: {code} REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; {code} Why can't we implement automatic name space resolution as so we can just reference UPPER without namespace qualifiers? {code} REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; TweetsInaug = FILTER Tweets BY UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; {code} I know about the workaround: {code} define org.apache.pig.piggybank.evaluation.string.UPPER UPPER {code} But this is really a pain to do if I have lots of functions. Just warn if there is a collision and suggest I use the define workaround in the warning messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-863) Function (UDF) automatic namespace resolution is really needed
[ https://issues.apache.org/jira/browse/PIG-863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723683#action_12723683 ] Dmitriy V. Ryaboy commented on PIG-863: --- I believe PIG-832 addresses this Function (UDF) automatic namespace resolution is really needed -- Key: PIG-863 URL: https://issues.apache.org/jira/browse/PIG-863 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz The Apache PiggyBank documentation says that to reference a function, I need to specify a function as: org.apache.pig.piggybank.evaluation.string.UPPER(text) As in the example: {code} REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; {code} Why can't we implement automatic name space resolution as so we can just reference UPPER without namespace qualifiers? {code} REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; TweetsInaug = FILTER Tweets BY UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; {code} I know about the workaround: {code} define org.apache.pig.piggybank.evaluation.string.UPPER UPPER {code} But this is really a pain to do if I have lots of functions. Just warn if there is a collision and suggest I use the define workaround in the warning messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: [VOTE] Release Pig 0.3.0 (candidate 0)
Hi, Thanks for everybody who voted! We have four +1 binding votes from PMC members Arun Murthy, Nigel Daley, Alan Gates, and Olga Natkovich. We have three +1 non-binding votes from Pig Committers Pradeep Kamath, Daniel Dai, and Santhosh Srinivasan There are no -1 votes. Also sufficient time has passed to make the release official. Unless I hear otherwise, I am going to start publishing the release shortly. Olga -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Thursday, June 18, 2009 12:30 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org; gene...@hadoop.apache.org Subject: [VOTE] Release Pig 0.3.0 (candidate 0) Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. I ran the rat report and made sure that all the source files contain proper headers. (Not attaching the report since it caused trouble with the last release.) Keys used to sign the release candidate are at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS. Please, download and try the release candidate: http://people.apache.org/~olga/pig-0.3.0-candidate-0/. Please, vote by Wednesday, June 24th. Olga
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-660: --- Fix Version/s: (was: 0.1.0) 0.4.0 Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-391) There should be a DataAtom.setValue(DataAtom)
[ https://issues.apache.org/jira/browse/PIG-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-391: --- Resolution: Won't Fix Status: Resolved (was: Patch Available) This is no longer relevant in the current code There should be a DataAtom.setValue(DataAtom) - Key: PIG-391 URL: https://issues.apache.org/jira/browse/PIG-391 Project: Pig Issue Type: Improvement Affects Versions: 0.1.0 Reporter: Ted Dunning Fix For: 0.1.0 Attachments: setValue.patch setValue on a DataAtom can accept a string or integer or double, but not a DataAtom. That means I have to inject a string conversion or type test into my code when I write a UDF. Definitely not good. This should be trivial. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-740) Incorrect line number is generated when a string with double quotes is used instead of single quotes and is passed to UDF
[ https://issues.apache.org/jira/browse/PIG-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-740: --- Affects Version/s: (was: 0.1.0) 0.3.0 Fix Version/s: (was: 0.1.0) 0.4.0 Incorrect line number is generated when a string with double quotes is used instead of single quotes and is passed to UDF -- Key: PIG-740 URL: https://issues.apache.org/jira/browse/PIG-740 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Priority: Minor Fix For: 0.4.0 Consider the Pig script with the error that a String with double quotes {code}www\\.{code} is used instead of a single quote {code}'www\\.'{code} in the UDF string.REPLACEALL() {code} register string-2.0.jar; A = load 'inputdata' using PigStorage() as ( curr_searchQuery ); B = foreach A { domain = string.REPLACEALL(curr_searchQuery,^www\\.,''); generate domain; }; dump B; {code} I get the following error message where Line 11 points to the end of file. The error message should point to Line 5. === 2009-03-31 01:33:38,403 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2009-03-31 01:33:39,168 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2009-03-31 01:33:39,589 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 11, column 0. Encountered: EOF after : Details at logfile: /home/viraj/pig-svn/trunk/pig_1238463218046.log === The log file contains the following contents === ERROR 1000: Error during parsing. Lexical error at line 11, column 0. Encountered: EOF after : org.apache.pig.tools.pigscript.parser.TokenMgrError: Lexical error at line 11, column 0. Encountered: EOF after : at org.apache.pig.tools.pigscript.parser.PigScriptParserTokenManager.getNextToken(PigScriptParserTokenManager.java:2739) at org.apache.pig.tools.pigscript.parser.PigScriptParser.jj_ntk(PigScriptParser.java:778) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:89) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:352) === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-742) Spaces could be optional in Pig syntax
[ https://issues.apache.org/jira/browse/PIG-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-742: --- Affects Version/s: (was: 0.1.0) 0.3.0 Fix Version/s: (was: 0.1.0) Spaces could be optional in Pig syntax -- Key: PIG-742 URL: https://issues.apache.org/jira/browse/PIG-742 Project: Pig Issue Type: Wish Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Priority: Minor The following Pig statements generate an error if there is no space between A and = {code} A=load 'quf.txt' using PigStorage() as (q, u, f:long); B = group A by (q); C = foreach B { F = order A by f desc; generate F; }; describe C; dump C; {code} 2009-03-31 17:14:15,959 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered PATH A=load at line 1, column 1. Was expecting one of: EOF cat ... cd ... cp ... copyFromLocal ... copyToLocal ... dump ... describe ... aliases ... explain ... help ... kill ... ls ... mv ... mkdir ... pwd ... quit ... register ... rm ... rmf ... set ... illustrate ... run ... exec ... scriptDone ... ... EOL ... ; ... It would be nice if the parser would not expect these space requirements between an alias and = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-817) Pig Docs for 0.3.0 Release
[ https://issues.apache.org/jira/browse/PIG-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723765#action_12723765 ] Olga Natkovich commented on PIG-817: Pig-817-5.patch committed Pig Docs for 0.3.0 Release -- Key: PIG-817 URL: https://issues.apache.org/jira/browse/PIG-817 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.3.0 Reporter: Corinne Chandel Attachments: PIG-817-2.patch, PIG-817-4.patch, Pig-817-5.patch Update Pig docs for 0.3.0 release Getting Started Pig Latin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-817) Pig Docs for 0.3.0 Release
[ https://issues.apache.org/jira/browse/PIG-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723771#action_12723771 ] Corinne Chandel commented on PIG-817: - Thanks Olga! Pig Docs for 0.3.0 Release -- Key: PIG-817 URL: https://issues.apache.org/jira/browse/PIG-817 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.3.0 Reporter: Corinne Chandel Attachments: PIG-817-2.patch, PIG-817-4.patch, Pig-817-5.patch Update Pig docs for 0.3.0 release Getting Started Pig Latin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-862) Pig Site - 0.3.0 updates
[ https://issues.apache.org/jira/browse/PIG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-862: --- Resolution: Fixed Status: Resolved (was: Patch Available) patch committed; thanks, Corinne. Pig Site - 0.3.0 updates Key: PIG-862 URL: https://issues.apache.org/jira/browse/PIG-862 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.3.0 Reporter: Corinne Chandel Attachments: PIG-862.patch Updates for Pig Site change home tab to project tab added search bar cleaned up logo image -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-820: - Status: Patch Available (was: Open) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Ashutosh Chauhan Fix For: 0.4.0 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-820: - Attachment: pig-820_v5.patch PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Ashutosh Chauhan Fix For: 0.4.0 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723812#action_12723812 ] Alan Gates commented on PIG-794: PIG-734 has been committed. This will allow this patch to simplify its handling of maps to match avro maps, since Pig maps now only allow strings as keys. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.