Re: requirements for Pig 1.0?

2009-06-24 Thread Russell Jurney

For 1.0 - complete Owl?

http://wiki.apache.org/pig/Metadata

Russell Jurney
rjur...@cloudstenography.com


On Jun 23, 2009, at 4:40 PM, Alan Gates wrote:

I don't believe there's a solid list of want to haves for 1.0.  The  
big issue I see is that there are too many interfaces that are still  
shifting, such as:


1) Data input/output formats.  The way we do slicing (that is, user  
provided InputFormats) and the equivalent outputs aren't yet solid.   
They are still too tied to load and store functions.  We need to  
break those out and understand how they will be expressed in the  
language. Related to this is the semantics of how Pig interacts with  
non-file based inputs and outputs.  We have a suggestion of moving  
to URLs, but we haven't finished test driving this to see if it will  
really be what we want.


2) The memory model.  While technically the choices we make on how  
to represent things in memory are internal, the reality is that  
these changes may affect the way we read and write tuples and bags,  
which in turn may affect our load, store, eval, and filter functions.


3) SQL.  We're working on introducing SQL soon, and it will take it  
a few releases to be fully baked.


4) Much better error messages.  In 0.2 our error messages made a  
leap forward, but before we can claim to be 1.0 I think they need to  
make 2 more leaps:  1) they need to be written in a way end users  
can understand them instead of in a way engineers can understand  
them, including having sufficient error documentation with suggested  
courses of action, etc.; 2) they need to be much better at tying  
errors back to where they happened in the script, right now if one  
of the MR jobs associated with a Pig Latin script fails there is no  
way to know what part of the script it is associated with.


There are probably others, but those are the ones I can think of off  
the top of my head.  The summary from my viewpoint is we still have  
several 0.x releases before we're ready to consider 1.0.  It would  
be nice to be 1.0 not too long after Hadoop is, which still gives us  
at least 6-9 months.


Alan.


On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:

I know there was some discussion of making the types release (0.2)  
a Pig 1
release, but that got nixed. There wasn't a similar discussion on  
0.3.

Has the list of want-to-haves for Pig 1.0 been discussed since?






[jira] Commented: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723508#action_12723508
 ] 

Hadoop QA commented on PIG-851:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12411309/patch_815.txt
  against trunk revision 787908.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 7 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/98/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/98/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/98/console

This message is automatically generated.

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: patch_815.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Pig-Patch-minerva.apache.org #98

2009-06-24 Thread Apache Hudson Server
See 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/98/changes

Changes:

[daijy] PIG-832: Make import list configurable

--
[...truncated 94726 lines...]
 [exec] [junit] 09/06/24 09:14:53 INFO 
mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1
 [exec] [junit] 09/06/24 09:14:53 INFO 
mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1
 [exec] [junit] 09/06/24 09:14:53 INFO dfs.DataNode: Deleting block 
blk_-8813130154407314932_1005 file 
dfs/data/data2/current/blk_-8813130154407314932
 [exec] [junit] 09/06/24 09:14:53 INFO dfs.DataNode: Deleting block 
blk_-6716071823453029809_1004 file 
dfs/data/data1/current/blk_-6716071823453029809
 [exec] [junit] 09/06/24 09:14:53 INFO dfs.DataNode: Deleting block 
blk_602371418698131255_1006 file dfs/data/data1/current/blk_602371418698131255
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Deleting block 
blk_-8813130154407314932_1005 file 
dfs/data/data8/current/blk_-8813130154407314932
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Deleting block 
blk_602371418698131255_1006 file dfs/data/data7/current/blk_602371418698131255
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* ask 
127.0.0.1:53034 to delete  blk_-6716071823453029809_1004 
blk_602371418698131255_1006 blk_-8813130154407314932_1005
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* ask 
127.0.0.1:39667 to delete  blk_-6716071823453029809_1004
 [exec] [junit] 09/06/24 09:14:54 INFO 
mapReduceLayer.JobControlCompiler: Setting up single store job
 [exec] [junit] 09/06/24 09:14:54 WARN mapred.JobClient: Use 
GenericOptionsParser for parsing the arguments. Applications should implement 
Tool for the same.
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/tmp/hadoop-hudson/mapred/system/job_200906240914_0002/job.jar. 
blk_1281412583459416781_1012
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block 
blk_1281412583459416781_1012 src: /127.0.0.1:60705 dest: /127.0.0.1:40234
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block 
blk_1281412583459416781_1012 src: /127.0.0.1:42301 dest: /127.0.0.1:53034
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block 
blk_1281412583459416781_1012 src: /127.0.0.1:48341 dest: /127.0.0.1:39667
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Received block 
blk_1281412583459416781_1012 of size 1428031 from /127.0.0.1
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:39667 is added to 
blk_1281412583459416781_1012 size 1428031
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: PacketResponder 0 
for block blk_1281412583459416781_1012 terminating
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Received block 
blk_1281412583459416781_1012 of size 1428031 from /127.0.0.1
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: PacketResponder 1 
for block blk_1281412583459416781_1012 terminating
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:53034 is added to 
blk_1281412583459416781_1012 size 1428031
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Received block 
blk_1281412583459416781_1012 of size 1428031 from /127.0.0.1
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: PacketResponder 2 
for block blk_1281412583459416781_1012 terminating
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:40234 is added to 
blk_1281412583459416781_1012 size 1428031
 [exec] [junit] 09/06/24 09:14:54 INFO fs.FSNamesystem: Increasing 
replication for file 
/tmp/hadoop-hudson/mapred/system/job_200906240914_0002/job.jar. New replication 
is 2
 [exec] [junit] 09/06/24 09:14:54 INFO fs.FSNamesystem: Reducing 
replication for file 
/tmp/hadoop-hudson/mapred/system/job_200906240914_0002/job.jar. New replication 
is 2
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/tmp/hadoop-hudson/mapred/system/job_200906240914_0002/job.split. 
blk_-1411835332935289445_1013
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block 
blk_-1411835332935289445_1013 src: /127.0.0.1:33410 dest: /127.0.0.1:41639
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block 
blk_-1411835332935289445_1013 src: /127.0.0.1:60709 dest: /127.0.0.1:40234
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Receiving block 
blk_-1411835332935289445_1013 src: /127.0.0.1:48344 dest: /127.0.0.1:39667
 [exec] [junit] 09/06/24 09:14:54 INFO dfs.DataNode: Received block 

Build failed in Hudson: Pig-Patch-minerva.apache.org #99

2009-06-24 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/99/

--
started
Building remotely on minerva.apache.org (Ubuntu)
Updating http://svn.apache.org/repos/asf/hadoop/pig/trunk
Fetching 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch' at 
-1 into 
'http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/trunk/test/bin'
 
At revision 787980
At revision 787980
no change for http://svn.apache.org/repos/asf/hadoop/pig/trunk since the 
previous build
no change for http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch 
since the previous build
[Pig-Patch-minerva.apache.org] $ /bin/bash /tmp/hudson8576193962376397285.sh
/home/hudson/tools/java/latest1.6/bin/java
Buildfile: build.xml

check-for-findbugs:

findbugs.check:

java5.check:

forrest.check:

hudson-test-patch:
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Testing patch for PIG-862.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Reverted 'src/org/apache/pig/data/DataType.java'
 [exec] 
 [exec] Fetching external item into 'test/bin'
 [exec] Atest/bin/test-patch.sh
 [exec] Updated external to revision 787980.
 [exec] 
 [exec] Updated to revision 787980.
 [exec] PIG-862 patch is being downloaded at Wed Jun 24 11:24:30 UTC 2009 
from
 [exec] 
http://issues.apache.org/jira/secure/attachment/12411560/PIG-862.patch
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Pre-building trunk to determine trunk number
 [exec] of release audit, javac, and Findbugs warnings.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] /home/hudson/tools/ant/latest/bin/ant  
-Djava5.home=/home/hudson/tools/java/latest1.5 
-Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= releaseaudit 
 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/patchprocess/trunkReleaseAuditWarnings.txt
  21
 [exec] /home/hudson/tools/ant/latest/bin/ant  -Djavac.args=-Xlint 
-Xmaxwarns 1000 -Declipse.home=/home/nigel/tools/eclipse/latest 
-Djava5.home=/home/hudson/tools/java/latest1.5 
-Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= clean tar  
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/ws/patchprocess/trunkJavacWarnings.txt
  21
 [exec] /home/hudson/tools/ant/latest/bin/ant  
-Dfindbugs.home=/home/nigel/tools/findbugs/latest 
-Djava5.home=/home/hudson/tools/java/latest1.5 
-Dforrest.home=/home/nigel/tools/forrest/latest -DPigPatchProcess= findbugs  
/dev/null 21
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Checking there are no @author tags in the patch.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] There appear to be 0 @author tags in the patch.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Checking there are new or changed tests in the patch.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] There appear to be 0 test files referenced in the patch.
 [exec] The patch appears to be a documentation patch that doesn't require 
tests.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Applying patch.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] (Stripping trailing CRs from patch.)
 [exec] can't find file to patch at input line 5
 [exec] Perhaps you used the wrong -p or --strip option?
 [exec] The text leading up to this was:
 [exec] --
 [exec] 

[jira] Commented: (PIG-862) Pig Site - 0.3.0 updates

2009-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723528#action_12723528
 ] 

Hadoop QA commented on PIG-862:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12411560/PIG-862.patch
  against trunk revision 787908.

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/99/console

This message is automatically generated.

 Pig Site - 0.3.0 updates
 

 Key: PIG-862
 URL: https://issues.apache.org/jira/browse/PIG-862
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.3.0
Reporter: Corinne Chandel
 Attachments: PIG-862.patch


 Updates for Pig Site
  change home tab to project tab
  added search bar
  cleaned up logo image

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723554#action_12723554
 ] 

Hudson commented on PIG-832:


Integrated in Pig-trunk #484 (See 
[http://hudson.zones.apache.org/hudson/job/Pig-trunk/484/])
: Make import list configurable


 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-832-1.patch, PIG-832-2.patch


 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-734) Non-string keys in maps

2009-06-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723555#action_12723555
 ] 

Hudson commented on PIG-734:


Integrated in Pig-trunk #484 (See 
[http://hudson.zones.apache.org/hudson/job/Pig-trunk/484/])
:  Changed maps to only take strings as keys.


 Non-string keys in maps
 ---

 Key: PIG-734
 URL: https://issues.apache.org/jira/browse/PIG-734
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.4.0

 Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch


 With the addition of types to pig, maps were changed to allow any atomic type 
 to be a key.  However, in practice we do not see people using keys other than 
 strings.  And allowing multiple types is causing us issues in serializing 
 data (we have to check what every key type is) and in the design for non-java 
 UDFs (since many scripting languages include associative arrays such as 
 Perl's hash).
 So I propose we scope back maps to only have string keys.  This would be a 
 non-compatible change.  But I am not aware of anyone using non-string keys, 
 so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-851:
---

Status: Open  (was: Patch Available)

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: patch_815.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-851:
---

Attachment: (was: patch_815.txt)

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
Assignee: Jeff Zhang
 Fix For: 0.4.0


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-851:
---

Status: Patch Available  (was: Open)

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: Pig_851_patch.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: requirements for Pig 1.0?

2009-06-24 Thread Alan Gates
Integration with Owl is something we want for 1.0.  I am hopeful that  
by Pig's 1.0 Owl will have flown the coop and become either a  
subproject or found a home in Hadoop's common, since it will hopefully  
be used by multiple other subprojects.


Alan.

On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote:


For 1.0 - complete Owl?

http://wiki.apache.org/pig/Metadata

Russell Jurney
rjur...@cloudstenography.com


On Jun 23, 2009, at 4:40 PM, Alan Gates wrote:

I don't believe there's a solid list of want to haves for 1.0.  The  
big issue I see is that there are too many interfaces that are  
still shifting, such as:


1) Data input/output formats.  The way we do slicing (that is, user  
provided InputFormats) and the equivalent outputs aren't yet  
solid.  They are still too tied to load and store functions.  We  
need to break those out and understand how they will be expressed  
in the language. Related to this is the semantics of how Pig  
interacts with non-file based inputs and outputs.  We have a  
suggestion of moving to URLs, but we haven't finished test driving  
this to see if it will really be what we want.


2) The memory model.  While technically the choices we make on how  
to represent things in memory are internal, the reality is that  
these changes may affect the way we read and write tuples and bags,  
which in turn may affect our load, store, eval, and filter functions.


3) SQL.  We're working on introducing SQL soon, and it will take it  
a few releases to be fully baked.


4) Much better error messages.  In 0.2 our error messages made a  
leap forward, but before we can claim to be 1.0 I think they need  
to make 2 more leaps:  1) they need to be written in a way end  
users can understand them instead of in a way engineers can  
understand them, including having sufficient error documentation  
with suggested courses of action, etc.; 2) they need to be much  
better at tying errors back to where they happened in the script,  
right now if one of the MR jobs associated with a Pig Latin script  
fails there is no way to know what part of the script it is  
associated with.


There are probably others, but those are the ones I can think of  
off the top of my head.  The summary from my viewpoint is we still  
have several 0.x releases before we're ready to consider 1.0.  It  
would be nice to be 1.0 not too long after Hadoop is, which still  
gives us at least 6-9 months.


Alan.


On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:

I know there was some discussion of making the types release (0.2)  
a Pig 1
release, but that got nixed. There wasn't a similar discussion on  
0.3.

Has the list of want-to-haves for Pig 1.0 been discussed since?








Re: requirements for Pig 1.0?

2009-06-24 Thread Dmitriy Ryaboy
Alan, any thoughts on performance baselines and benchmarks?

I am a little surprised that you think SQL is a requirement for 1.0, since
it's essentially an overlay, not core functionality.

What about the storage layer rewrite (or is that what you referred to with
your first bullet-point)?

Also, the subject of making more (or all) operators nestable within a
foreach comes up now and then.. would you consider this important for 1.0,
or something that can wait?

Integration with other languages (a-la PyPig)?

The Roadmap on the Wiki is still as of Q3 2007 makes it hard for an
outside contributor to know where to jump :-).

-D


On Wed, Jun 24, 2009 at 10:02 AM, Alan Gates ga...@yahoo-inc.com wrote:

 Integration with Owl is something we want for 1.0.  I am hopeful that by
 Pig's 1.0 Owl will have flown the coop and become either a subproject or
 found a home in Hadoop's common, since it will hopefully be used by multiple
 other subprojects.

 Alan.


 On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote:

  For 1.0 - complete Owl?

 http://wiki.apache.org/pig/Metadata

 Russell Jurney
 rjur...@cloudstenography.com


 On Jun 23, 2009, at 4:40 PM, Alan Gates wrote:

  I don't believe there's a solid list of want to haves for 1.0.  The big
 issue I see is that there are too many interfaces that are still shifting,
 such as:

 1) Data input/output formats.  The way we do slicing (that is, user
 provided InputFormats) and the equivalent outputs aren't yet solid.  They
 are still too tied to load and store functions.  We need to break those out
 and understand how they will be expressed in the language. Related to this
 is the semantics of how Pig interacts with non-file based inputs and
 outputs.  We have a suggestion of moving to URLs, but we haven't finished
 test driving this to see if it will really be what we want.

 2) The memory model.  While technically the choices we make on how to
 represent things in memory are internal, the reality is that these changes
 may affect the way we read and write tuples and bags, which in turn may
 affect our load, store, eval, and filter functions.

 3) SQL.  We're working on introducing SQL soon, and it will take it a few
 releases to be fully baked.

 4) Much better error messages.  In 0.2 our error messages made a leap
 forward, but before we can claim to be 1.0 I think they need to make 2 more
 leaps:  1) they need to be written in a way end users can understand them
 instead of in a way engineers can understand them, including having
 sufficient error documentation with suggested courses of action, etc.; 2)
 they need to be much better at tying errors back to where they happened in
 the script, right now if one of the MR jobs associated with a Pig Latin
 script fails there is no way to know what part of the script it is
 associated with.

 There are probably others, but those are the ones I can think of off the
 top of my head.  The summary from my viewpoint is we still have several 0.x
 releases before we're ready to consider 1.0.  It would be nice to be 1.0 not
 too long after Hadoop is, which still gives us at least 6-9 months.

 Alan.


 On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:

  I know there was some discussion of making the types release (0.2) a
 Pig 1
 release, but that got nixed. There wasn't a similar discussion on 0.3.
 Has the list of want-to-haves for Pig 1.0 been discussed since?







Re: requirements for Pig 1.0?

2009-06-24 Thread Alan Gates
To be clear, going to 1.0 is not about having a certain set of  
features.  It is about stability and usability.  When a project  
declares itself 1.0 it is making some guarantees regarding the  
stability of its interfaces (in Pig's case this is Pig Latin, UDFs,  
and command line usage).  It is also declaring itself ready for the  
world at large, not just the brave and the free.  New features can  
come in as experimental once we're 1.0, but the semantics of the  
language and UDFs can't be shifting (as we've done the last several  
releases and will continue to do for a bit I think).


With that in mind, further comments inlined.

On Jun 24, 2009, at 10:18 AM, Dmitriy Ryaboy wrote:


Alan, any thoughts on performance baselines and benchmarks?
Meaning do we need to reach a certain speed before 1.0?  I don't think  
so.  Pig is fast enough now that many people find it useful.  We want  
to continue working to shrink the gap between Pig and MR, but I don't  
see this as a blocker for 1.0.




I am a little surprised that you think SQL is a requirement for 1.0,  
since

it's essentially an overlay, not core functionality.
If we were debating today whether to go 1.0, I agree that we would not  
wait for SQL.  But given that we aren't (at least I wouldn't vote for  
it now) and that SQL will be in soon, it will need to stabilize.


What about the storage layer rewrite (or is that what you referred  
to with

your first bullet-point)?
To be clear, the Zebra (columnar store stuff) is not a rewrite of the  
storage layer.  It is an additional storage option we want to  
support.  We aren't changing current support for load and store.




Also, the subject of making more (or all) operators nestable within a
foreach comes up now and then.. would you consider this important  
for 1.0,

or something that can wait?

This would be an added feature, not a semantic change in Pig Latin.



Integration with other languages (a-la PyPig)?

Again, this is a new feature, not a stability issue.



The Roadmap on the Wiki is still as of Q3 2007 makes it hard  
for an

outside contributor to know where to jump :-).
Agreed.  Olga has given me the task of updating this soon.  I'm going  
to try to get to that over the next couple of weeks.  This discussion  
will certainly provide input to that update.


Alan.




[jira] Created: (PIG-863) Function (UDF) automatic namespace resolution is really needed

2009-06-24 Thread David Ciemiewicz (JIRA)
Function (UDF) automatic namespace resolution is really needed
--

 Key: PIG-863
 URL: https://issues.apache.org/jira/browse/PIG-863
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz


The Apache PiggyBank documentation says that to reference a function, I need to 
specify a function as:

org.apache.pig.piggybank.evaluation.string.UPPER(text)

As in the example:

{code}
REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
TweetsInaug  = FILTER Tweets BY 
org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES 
'.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
{code}

Why can't we implement automatic name space resolution as so we can just 
reference UPPER without namespace qualifiers?

{code}
REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
TweetsInaug  = FILTER Tweets BY UPPER(text) MATCHES 
'.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
{code}

I know about the workaround:

{code}
define org.apache.pig.piggybank.evaluation.string.UPPER UPPER
{code}

But this is really a pain to do if I have lots of functions.

Just warn if there is a collision and suggest I use the define workaround in 
the warning messages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-863) Function (UDF) automatic namespace resolution is really needed

2009-06-24 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723683#action_12723683
 ] 

Dmitriy V. Ryaboy commented on PIG-863:
---

I believe PIG-832  addresses this 

 Function (UDF) automatic namespace resolution is really needed
 --

 Key: PIG-863
 URL: https://issues.apache.org/jira/browse/PIG-863
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz

 The Apache PiggyBank documentation says that to reference a function, I need 
 to specify a function as:
 org.apache.pig.piggybank.evaluation.string.UPPER(text)
 As in the example:
 {code}
 REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
 TweetsInaug  = FILTER Tweets BY 
 org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES 
 '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
 {code}
 Why can't we implement automatic name space resolution as so we can just 
 reference UPPER without namespace qualifiers?
 {code}
 REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
 TweetsInaug  = FILTER Tweets BY UPPER(text) MATCHES 
 '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
 {code}
 I know about the workaround:
 {code}
 define org.apache.pig.piggybank.evaluation.string.UPPER UPPER
 {code}
 But this is really a pain to do if I have lots of functions.
 Just warn if there is a collision and suggest I use the define workaround 
 in the warning messages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [VOTE] Release Pig 0.3.0 (candidate 0)

2009-06-24 Thread Olga Natkovich
Hi,
 
Thanks for everybody who voted!

We have four +1 binding votes from PMC members Arun Murthy, Nigel Daley,
Alan Gates, and Olga Natkovich. We have three +1 non-binding votes from
Pig Committers Pradeep Kamath, Daniel Dai, and Santhosh Srinivasan

There are no -1 votes. Also sufficient time has passed to make the
release official.
 
Unless I hear otherwise, I am going to start publishing the release
shortly.
 
Olga 

 -Original Message-
 From: Olga Natkovich [mailto:ol...@yahoo-inc.com] 
 Sent: Thursday, June 18, 2009 12:30 PM
 To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org; 
 gene...@hadoop.apache.org
 Subject: [VOTE] Release Pig 0.3.0 (candidate 0)
 
 Hi,
  
 I created a candidate build for Pig 0.3.0 release. The main 
 feature of this release is support for multiquery which 
 allows to share computation across multiple queries within 
 the same script. We see significant performance improvements 
 (up to order of magnitude) as the result of this optimization.
  
 I ran the rat report and made sure that all the source files 
 contain proper headers. (Not attaching the report since it 
 caused trouble with the last release.)
  
 Keys used to sign the release candidate are at 
 http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS.
  
 Please, download and try the release candidate:
 http://people.apache.org/~olga/pig-0.3.0-candidate-0/.
  
 Please, vote by Wednesday, June 24th.
  
 Olga
  
 


[jira] Updated: (PIG-660) Integration with Hadoop 0.20

2009-06-24 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-660:
---

Fix Version/s: (was: 0.1.0)
   0.4.0

 Integration with Hadoop 0.20
 

 Key: PIG-660
 URL: https://issues.apache.org/jira/browse/PIG-660
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, 
 PIG-660_3.patch


 With Hadoop 0.20, it will be possible to query the status of each map and 
 reduce in a map reduce job. This will allow better error reporting. Some of 
 the other items that could be on Hadoop's feature requests/bugs are 
 documented here for tracking.
 1. Hadoop should return objects instead of strings when exceptions are thrown
 2. The JobControl should handle all exceptions and report them appropriately. 
 For example, when the JobControl fails to launch jobs, it should handle 
 exceptions appropriately and should support APIs that query this state, i.e., 
 failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-391) There should be a DataAtom.setValue(DataAtom)

2009-06-24 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-391:
---

Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

This is no longer relevant in the current code

 There should be a DataAtom.setValue(DataAtom)
 -

 Key: PIG-391
 URL: https://issues.apache.org/jira/browse/PIG-391
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.1.0
Reporter: Ted Dunning
 Fix For: 0.1.0

 Attachments: setValue.patch


 setValue on a DataAtom can accept a string or integer or double, but not a 
 DataAtom.  That means I have to inject a string conversion or type test into 
 my code when I write a UDF.  Definitely not good.
 This should be trivial.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-740) Incorrect line number is generated when a string with double quotes is used instead of single quotes and is passed to UDF

2009-06-24 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-740:
---

Affects Version/s: (was: 0.1.0)
   0.3.0
Fix Version/s: (was: 0.1.0)
   0.4.0

 Incorrect line number is generated when a string  with double quotes is used 
 instead of single quotes and is passed to UDF
 --

 Key: PIG-740
 URL: https://issues.apache.org/jira/browse/PIG-740
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Priority: Minor
 Fix For: 0.4.0


 Consider the Pig script with the error that a String with double quotes 
 {code}www\\.{code} is used instead of a single quote {code}'www\\.'{code} 
 in the UDF string.REPLACEALL()
 {code}
 register string-2.0.jar;
 A = load 'inputdata' using PigStorage() as ( curr_searchQuery );
 B = foreach A {
 domain = string.REPLACEALL(curr_searchQuery,^www\\.,'');
 generate
 domain;
 };
 dump B;
 {code}
 I get the following error message where Line 11 points to the end of file. 
 The error message should point to Line 5.
 ===
 2009-03-31 01:33:38,403 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localhost:9000
 2009-03-31 01:33:39,168 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localhost:9001
 2009-03-31 01:33:39,589 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Lexical error at line 11, column 0.  Encountered: 
 EOF after : 
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1238463218046.log
 ===
 The log file contains the following contents
 ===
 ERROR 1000: Error during parsing. Lexical error at line 11, column 0.  
 Encountered: EOF after : 
 org.apache.pig.tools.pigscript.parser.TokenMgrError: Lexical error at line 
 11, column 0.  Encountered: EOF after : 
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParserTokenManager.getNextToken(PigScriptParserTokenManager.java:2739)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.jj_ntk(PigScriptParser.java:778)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:89)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:352)
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-742) Spaces could be optional in Pig syntax

2009-06-24 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-742:
---

Affects Version/s: (was: 0.1.0)
   0.3.0
Fix Version/s: (was: 0.1.0)

 Spaces could be optional in Pig syntax
 --

 Key: PIG-742
 URL: https://issues.apache.org/jira/browse/PIG-742
 Project: Pig
  Issue Type: Wish
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Priority: Minor

 The following Pig statements generate an error if there is no space between A 
  and =
 {code}
 A=load 'quf.txt' using PigStorage() as (q, u, f:long);
 B = group A by (q);
 C = foreach B {
 F = order A by f desc;
 generate F;
 };
 describe C;
 dump C;
 {code}
 2009-03-31 17:14:15,959 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Encountered
  PATH A=load  at line 1, column 1.
 Was expecting one of:
 EOF 
 cat ...
 cd ...
 cp ...
 copyFromLocal ...
 copyToLocal ...
 dump ...
 describe ...
 aliases ...
 explain ...
 help ...
 kill ...
 ls ...
 mv ...
 mkdir ...
 pwd ...
 quit ...
 register ...
 rm ...
 rmf ...
 set ...
 illustrate ...
 run ...
 exec ...
 scriptDone ...
  ...
 EOL ...
 ; ...
 It would be nice if the parser would not expect these space requirements 
 between an alias and =

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-817) Pig Docs for 0.3.0 Release

2009-06-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723765#action_12723765
 ] 

Olga Natkovich commented on PIG-817:


Pig-817-5.patch committed

 Pig Docs for 0.3.0 Release
 --

 Key: PIG-817
 URL: https://issues.apache.org/jira/browse/PIG-817
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.3.0
Reporter: Corinne Chandel
 Attachments: PIG-817-2.patch, PIG-817-4.patch, Pig-817-5.patch


 Update Pig docs for 0.3.0 release
  Getting Started 
  Pig Latin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-817) Pig Docs for 0.3.0 Release

2009-06-24 Thread Corinne Chandel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723771#action_12723771
 ] 

Corinne Chandel commented on PIG-817:
-

Thanks Olga!

 Pig Docs for 0.3.0 Release
 --

 Key: PIG-817
 URL: https://issues.apache.org/jira/browse/PIG-817
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.3.0
Reporter: Corinne Chandel
 Attachments: PIG-817-2.patch, PIG-817-4.patch, Pig-817-5.patch


 Update Pig docs for 0.3.0 release
  Getting Started 
  Pig Latin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-862) Pig Site - 0.3.0 updates

2009-06-24 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-862:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed; thanks, Corinne.

 Pig Site - 0.3.0 updates
 

 Key: PIG-862
 URL: https://issues.apache.org/jira/browse/PIG-862
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.3.0
Reporter: Corinne Chandel
 Attachments: PIG-862.patch


 Updates for Pig Site
  change home tab to project tab
  added search bar
  cleaned up logo image

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-24 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Patch Available  (was: Open)

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-24 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v5.patch

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-06-24 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723812#action_12723812
 ] 

Alan Gates commented on PIG-794:


PIG-734 has been committed.  This will allow this patch to simplify its 
handling of maps to match avro maps, since Pig maps now only allow strings as 
keys.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.