[jira] Commented: (PIG-1102) Collect number of spills per job

2009-12-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792399#action_12792399
 ] 

Hadoop QA commented on PIG-1102:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428356/PIG_1102.patch
  against trunk revision 892125.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 400 release audit warnings 
(more than the trunk's current 397 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/139/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/139/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/139/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/139/console

This message is automatically generated.

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792468#action_12792468
 ] 

Hadoop QA commented on PIG-1157:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428359/PIG-1157.patch
  against trunk revision 892125.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/140/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/140/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/140/console

This message is automatically generated.

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply 

[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1157:
--

Status: Open  (was: Patch Available)

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1157:
--

Attachment: PIG-1157.patch

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1157:
--

Status: Patch Available  (was: Open)

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal

2009-12-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1110:
--

Attachment: PIG-1110.patch


The output from running ant test-patch target locally:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

{code}

 Handle compressed file formats -- Gz, BZip with the new proposal
 

 Key: PIG-1110
 URL: https://issues.apache.org/jira/browse/PIG-1110
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1110.patch, PIG-1110.patch, PIG_1110_Jeff.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1117) Pig reading hive columnar rc tables

2009-12-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792564#action_12792564
 ] 

Alan Gates commented on PIG-1117:
-

There seems to be a lot of code duplication between 
HiveColumnarLoader.setup(String, boolean, String) and 
HiveColumnarLoader.setup(String, boolean).  Could these two functions be 
combined or the common code factored out?

Pig doesn't support BOOLEAN and BYTE as an external types, we only use them 
internally.  So these should be converted to something else in 
HivecolumnarLoader.findPigDataType.

You may want to implement fieldsToRead, as that allows Pig to tell your loader 
exactly what fields it requires for this query, without requiring the user to 
specify it.

In HiveColumnarLoader.readRowColumns it is good to use 
TupleFactory.newTuple(int) rather than TupleFactory.newTuple() when you know 
the size of the tuple you'll be creating.  newTuple(int) plus Tuple.set() is 
more efficient than newTuple() + Tuple.append().

svn diff doesn't add jars to patch files, so you'll need to attach the 
hive-exec.jar separately to the jira so that we can run tests.

Also, please be aware that we are rewriting the entire load/store interface, 
and hope to release this soon, probably in 0.7.  See PIG-966 for details.  This 
obviously will affect your code.  Hopefully it will make it much easier, as the 
need to write a separate slicer will go away.


 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
 PIG-1117.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

2009-12-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792569#action_12792569
 ] 

Alan Gates commented on PIG-480:


What kind of performance gain do we get from this?  The only PigMIx query that 
looks like it would be directly affected is PigMix_3.  It would be interesting 
to run that and a few other queries that we expect would benefit from this to 
measure the performance improvements.

 PERFORMANCE: Use identity mapper in a chain of M-R jobs
 ---

 Key: PIG-480
 URL: https://issues.apache.org/jira/browse/PIG-480
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: PIG_480.patch, PIG_480.patch


 For jobs with two or more MR jobs, use identity mapper wherever possible in 
 second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
 map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792573#action_12792573
 ] 

Alan Gates commented on PIG-1149:
-

Changes look fine.

Pradeep, will this apply as is to the load-store redesign branch or will we 
need a separate patch for that?

 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-18 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792578#action_12792578
 ] 

Pradeep Kamath commented on PIG-1149:
-

I tried applying it on the branch and it failed:
{noformat}
:/tmp/load-store-redesign]patch -p0  
/homes/pradeepk/dev/pig-apache/pig/trunk/pig_1149.patch 
patching file src/org/apache/pig/impl/builtin/SampleLoader.java
Hunk #1 succeeded at 31 with fuzz 2 (offset -4 lines).
Hunk #2 FAILED at 46.
1 out of 2 hunks FAILED -- saving rejects to file 
src/org/apache/pig/impl/builtin/SampleLoader.java.rej
patching file test/org/apache/pig/test/TestPoissonSampleLoader.java
[prade...@chargesize:/tmp/load-store-redesign]
{noformat}

Since Thejas worked on PIG-1062, he might be in a better position to check 
whether this patch needs changes.

 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1158) pig command line -M option doesn't support table union correctly (comma seperated paths)

2009-12-18 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792589#action_12792589
 ] 

Richard Ding commented on PIG-1158:
---

Without -M option Pig converts paths to their absolute locations before passing 
them to the loaders/storers. With -M option, Pig passes the paths as is to 
the loaders/storers. 

This distinction seems to be obsolete. The fix will be to convert paths to 
their absolute locations in both cases.



 pig command line -M option doesn't support table union correctly (comma 
 seperated paths)
 

 Key: PIG-1158
 URL: https://issues.apache.org/jira/browse/PIG-1158
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Richard Ding
 Fix For: 0.7.0


 for example, load (1.txt,2.txt) USING 
 org.apache.hadoop.zebra.pig.TableLoader()
 i see this errror from stand out:
 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: 
 hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/1.txt,2.txt does not 
 exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-18 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792593#action_12792593
 ] 

Olga Natkovich commented on PIG-1157:
-

+1. Patch looks good. Will commit once the tests pass.

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1162) Pig 0.6.0 - UDF doc

2009-12-18 Thread Corinne Chandel (JIRA)
Pig 0.6.0 - UDF doc
---

 Key: PIG-1162
 URL: https://issues.apache.org/jira/browse/PIG-1162
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
 Fix For: 0.6.0


Pig 0.6.0 - UDF doc

Small corrections.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1162) Pig 0.6.0 - UDF doc

2009-12-18 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1162:
-

Attachment: pig-6-udf.patch

Patch file for UDF doc.

 Pig 0.6.0 - UDF doc
 ---

 Key: PIG-1162
 URL: https://issues.apache.org/jira/browse/PIG-1162
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
 Fix For: 0.6.0

 Attachments: pig-6-udf.patch


 Pig 0.6.0 - UDF doc
 Small corrections.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1162) Pig 0.6.0 - UDF doc

2009-12-18 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1162:
-

Status: Patch Available  (was: Open)

(1) apply this patch to Pig TRUNK

(2) apply this patch to Pig branch-0.6

(3) Note: No new test code required; changes to documentation only.

 Pig 0.6.0 - UDF doc
 ---

 Key: PIG-1162
 URL: https://issues.apache.org/jira/browse/PIG-1162
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
 Fix For: 0.6.0

 Attachments: pig-6-udf.patch


 Pig 0.6.0 - UDF doc
 Small corrections.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1102) Collect number of spills per job

2009-12-18 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792612#action_12792612
 ] 

Olga Natkovich commented on PIG-1102:
-

I will be reviewing this patch

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-18 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792627#action_12792627
 ] 

Thejas M Nair commented on PIG-1149:


I notice that this patch is using org.mortbay.log instead of 
org.apache.commons.logging. That is not used anywhere else in pig code. Should 
we replace that with org.apache.commons.logging ?

A small change is required to get the patch working with load-store branch. It 
no longer requires the load func to implement SampleLoader interface, and that 
interface has been removed. I can submit the modified patch. 
{code}
+loader = (SamplableLoader)PigContext.instantiateFuncFromSpec(funcSpec);
{code}
changes to 
+loader = (LoadFunc)PigContext.instantiateFuncFromSpec(funcSpec);
{code}


 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792633#action_12792633
 ] 

Hadoop QA commented on PIG-1157:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428448/PIG-1157.patch
  against trunk revision 892125.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/141/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/141/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/141/console

This message is automatically generated.

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-

[jira] Updated: (PIG-1158) pig command line -M option doesn't support table union correctly (comma seperated paths)

2009-12-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1158:
--

Attachment: PIG-1158.patch

 pig command line -M option doesn't support table union correctly (comma 
 seperated paths)
 

 Key: PIG-1158
 URL: https://issues.apache.org/jira/browse/PIG-1158
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1158.patch


 for example, load (1.txt,2.txt) USING 
 org.apache.hadoop.zebra.pig.TableLoader()
 i see this errror from stand out:
 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: 
 hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/1.txt,2.txt does not 
 exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1163) Pig/Zebra 0.6.0 release - Doc Updates

2009-12-18 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1163:
-

Attachment: zebra-6-update-1.patch

First update patch for Zebra 0.6.0 release

 Pig/Zebra 0.6.0 release - Doc Updates
 -

 Key: PIG-1163
 URL: https://issues.apache.org/jira/browse/PIG-1163
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra-6-update-1.patch


 Pig/Zebra 0.6.0 release - Doc Updates
 Updates for the Zebra 0.6.0 docs.
 (1) First patch - please apply the first patch now (zebra-6-update-1.patch)
 (2) Second patch - depending on feeback, we may have a second patch to apply 
 Jan 4 or Jan 5
 Thanks/C

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1163) Pig/Zebra 0.6.0 release - Doc Updates

2009-12-18 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-1163:
-

Status: Patch Available  (was: Open)

(1) Apply this patch to Pig TRUNK

(2) Apply this patch to Pig branch-0.6

(3) Note: No new test code required; changes to documentation only.

 Pig/Zebra 0.6.0 release - Doc Updates
 -

 Key: PIG-1163
 URL: https://issues.apache.org/jira/browse/PIG-1163
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra-6-update-1.patch


 Pig/Zebra 0.6.0 release - Doc Updates
 Updates for the Zebra 0.6.0 docs.
 (1) First patch - please apply the first patch now (zebra-6-update-1.patch)
 (2) Second patch - depending on feeback, we may have a second patch to apply 
 Jan 4 or Jan 5
 Thanks/C

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-18 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1157:


   Resolution: Fixed
Fix Version/s: (was: 0.6.0)
   0.7.0
   Status: Resolved  (was: Patch Available)

patch committed, thanks Richard

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1102) Collect number of spills per job

2009-12-18 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792680#action_12792680
 ] 

Sriranjan Manjunath commented on PIG-1102:
--

I ran the test again on my local machine, and it passes. The test failed 
because of too many open file descriptors. Is this a hudson related issue?

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1141) Make streaming work with the new load-store interfaces

2009-12-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792688#action_12792688
 ] 

Alan Gates commented on PIG-1141:
-

In DefaultInputHandler.close, why was the code that flushes and closes stdin 
removed?  Same question for DefaultOutputHandler and stdout.  It seems like we 
still need to flush and close these streams properly.

Similar to the above, close was removed from FileOutputHandler (but not 
FileInputHandler).

Both PigToStream and StreamToPig interfaces should have some javadoc comments 
for the interface explaining what they do and why.

In StorageUtil.parseFieldDel, you call Integer.valueOf(String) for both \u and 
\x.  For \x you should instead use Integer.valueOf(String, 16).


 Make streaming work with the new load-store interfaces 
 ---

 Key: PIG-1141
 URL: https://issues.apache.org/jira/browse/PIG-1141
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1141.patch, PIG-1141.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-18 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792689#action_12792689
 ] 

Thejas M Nair commented on PIG-1149:


The first test case failure is known, I will be fixing that with a patch in 
PIG-1094. 

The special string gets added to the last row only. But that looks unnecessary. 
I will be removing that with a new patch in PIG-1062. 

You can submit your patch for LSR branch patch, by checking for 5 columns in 
your test case. I will change your new test case as well when I submit  new 
PIG-1062 patch (to check for 4 columns).


 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1156) Add aliases to ExecJobs and PhysicalOperators

2009-12-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1156:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed.  Thanks Dmitriy.

 Add aliases to ExecJobs and PhysicalOperators
 -

 Key: PIG-1156
 URL: https://issues.apache.org/jira/browse/PIG-1156
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.7.0

 Attachments: pig_batchAliases.patch


 Currently, the way to use muti-query from Java is as follows:
 1.  pigServer.setBatchOn();
 2. register your queries with pigServer
 3. ListExecJob jobs = pigServer.executeBatch();
 4. for (ExecJob job : jobs) { IteratorTuple results = job.getResults(); }
 This will cause all stores to get evaluated in a single batch. However, there 
 is no way to identify which of the ExecJobs corresponds to which store.  We 
 should add aliases by which the stored relations are known to ExecJob in 
 order to allow the user to identify what the jobs correspond do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-18 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792698#action_12792698
 ] 

Thejas M Nair commented on PIG-1149:


I spoke too soon about the special string being unnecessary. GetMemNumRows uses 
it. I will add some comments to document that in PoissonSampleLoader .
In previous comment,  special string gets added to the last row only  should 
be special string gets added to the last *sample* row only.


 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1159) merge join right side table does not support comma seperated paths

2009-12-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1159:
--

Attachment: PIG-1159.patch

With this patch, Pig runtime no longer passes an InputStream to IndexableLoader 
through the bindTo method. An IndexableLoader is resposible to create its own 
InputStream for reading data. 

This actually isn't a new requirement:  currently all existing IndexableLoaders 
create their own InputStreams. And, in the future, with the load-store 
redesign, Pig runtime will no longer create InputStreams for the loaders.

 merge join right side table does not support comma seperated paths
 --

 Key: PIG-1159
 URL: https://issues.apache.org/jira/browse/PIG-1159
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1159.patch


 For example this is my script:(join_jira1.pig)
 register /grid/0/dev/hadoopqa/jars/zebra.jar;
 --a1 = load '1.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 --a2 = load '2.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 --sort1 = order a1 by a parallel 6;
 --sort2 = order a2 by a parallel 5;
 --store sort1 into 'asort1' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort2 into 'asort2' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort1 into 'asort3' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort2 into 'asort4' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 joinl = LOAD 'asort1,asort2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('a,b,c,d', 'sorted');
 joinr = LOAD 'asort3,asort4' USING 
 org.apache.hadoop.zebra.pig.TableLoader('a,b,c,d', 'sorted');
 joina = join joinl by a, joinr by a using merge ;
 dump joina;
 ==
 here is the log:
 Backend error message
 -
 java.lang.IllegalArgumentException: Pathname 
 /user/hadoopqa/asort3,hdfs:/gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort4
  from 
 hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort3,hdfs:/gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort4
  is not a valid DFS filename.
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:158)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:648)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:131)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:147)
 at 
 org.apache.pig.impl.io.FileLocalizer.fullPath(FileLocalizer.java:534)
 at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:338)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:398)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:184)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:159)
 Pig Stack Trace
 ---
 ERROR 6015: During execution, encountered a Hadoop error.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias joina
 at org.apache.pig.PigServer.openIterator(PigServer.java:482)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:386)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6015: 
 During 

[jira] Updated: (PIG-1159) merge join right side table does not support comma seperated paths

2009-12-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1159:
--

Status: Patch Available  (was: Open)

 merge join right side table does not support comma seperated paths
 --

 Key: PIG-1159
 URL: https://issues.apache.org/jira/browse/PIG-1159
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1159.patch


 For example this is my script:(join_jira1.pig)
 register /grid/0/dev/hadoopqa/jars/zebra.jar;
 --a1 = load '1.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 --a2 = load '2.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 --sort1 = order a1 by a parallel 6;
 --sort2 = order a2 by a parallel 5;
 --store sort1 into 'asort1' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort2 into 'asort2' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort1 into 'asort3' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort2 into 'asort4' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 joinl = LOAD 'asort1,asort2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('a,b,c,d', 'sorted');
 joinr = LOAD 'asort3,asort4' USING 
 org.apache.hadoop.zebra.pig.TableLoader('a,b,c,d', 'sorted');
 joina = join joinl by a, joinr by a using merge ;
 dump joina;
 ==
 here is the log:
 Backend error message
 -
 java.lang.IllegalArgumentException: Pathname 
 /user/hadoopqa/asort3,hdfs:/gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort4
  from 
 hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort3,hdfs:/gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort4
  is not a valid DFS filename.
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:158)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:648)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:131)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:147)
 at 
 org.apache.pig.impl.io.FileLocalizer.fullPath(FileLocalizer.java:534)
 at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:338)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:398)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:184)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:159)
 Pig Stack Trace
 ---
 ERROR 6015: During execution, encountered a Hadoop error.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias joina
 at org.apache.pig.PigServer.openIterator(PigServer.java:482)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:386)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6015: 
 During execution, encountered a Hadoop error.
 at 
 .apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:158)
 at 
 .apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
 at .apache.hadoop.fs.FileSystem.exists(FileSystem.java:648)at 
 .apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203)

[jira] Commented: (PIG-1141) Make streaming work with the new load-store interfaces

2009-12-18 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792720#action_12792720
 ] 

Richard Ding commented on PIG-1141:
---

bq. In DefaultInputHandler.close, why was the code that flushes and closes 
stdin removed? Same question for DefaultOutputHandler and stdout. It seems like 
we still need to flush and close these streams properly.

Becuase there is no 'stdin' or 'stdout' to flush and close :)

bq. Similar to the above, close was removed from FileOutputHandler (but not 
FileInputHandler).

I want to do the same for FileInputHandler, but findbugs doesn't allow it :(

bq. Both PigToStream and StreamToPig interfaces should have some javadoc 
comments for the interface explaining what they do and why.

I'll add javadoc for the interfaces.

bq. In StorageUtil.parseFieldDel, you call Integer.valueOf(String) for both \u 
and \x. For \x you should instead use Integer.valueOf(String, 16).

This is copied (refactored) from the current PigStorage code, do we want to 
change it?

 Make streaming work with the new load-store interfaces 
 ---

 Key: PIG-1141
 URL: https://issues.apache.org/jira/browse/PIG-1141
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1141.patch, PIG-1141.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1163) Pig/Zebra 0.6.0 release - Doc Updates

2009-12-18 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1163:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed to both trunk and 0.6.0 branch. Thanks, Corinne

 Pig/Zebra 0.6.0 release - Doc Updates
 -

 Key: PIG-1163
 URL: https://issues.apache.org/jira/browse/PIG-1163
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra-6-update-1.patch


 Pig/Zebra 0.6.0 release - Doc Updates
 Updates for the Zebra 0.6.0 docs.
 (1) First patch - please apply the first patch now (zebra-6-update-1.patch)
 (2) Second patch - depending on feeback, we may have a second patch to apply 
 Jan 4 or Jan 5
 Thanks/C

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1164) [zebra]smoke test

2009-12-18 Thread Jing Huang (JIRA)
[zebra]smoke test
-

 Key: PIG-1164
 URL: https://issues.apache.org/jira/browse/PIG-1164
 Project: Pig
  Issue Type: Test
Affects Versions: 0.6.0
Reporter: Jing Huang
 Fix For: 0.7.0


Change zebra build.xml file to add smoke target. 
And env.sh and run script under zebra/src/test/smoke dir

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




[jira] Commented: (PIG-1162) Pig 0.6.0 - UDF doc

2009-12-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792738#action_12792738
 ] 

Hadoop QA commented on PIG-1162:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428464/pig-6-udf.patch
  against trunk revision 892125.

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/142/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/142/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/142/console

This message is automatically generated.

 Pig 0.6.0 - UDF doc
 ---

 Key: PIG-1162
 URL: https://issues.apache.org/jira/browse/PIG-1162
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
 Fix For: 0.6.0

 Attachments: pig-6-udf.patch


 Pig 0.6.0 - UDF doc
 Small corrections.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1146) Inconsistent column pruning in LOUnion

2009-12-18 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai reassigned PIG-1146:
---

Assignee: Daniel Dai

 Inconsistent column pruning in LOUnion
 --

 Key: PIG-1146
 URL: https://issues.apache.org/jira/browse/PIG-1146
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1146-1.patch


 This happens when we do a union on two relations, if one column comes from a 
 loader, the other matching column comes from a constant, and this column get 
 pruned. We prune for the one from loader and did not prune the constant. Thus 
 leaves union an inconsistent state. Here is a script:
 {code}
 a = load '1.txt' as (a0, a1:chararray, a2);
 b = load '2.txt' as (b0, b2);
 c = foreach b generate b0, 'hello', b2;
 d = union a, c;
 e = foreach d generate $0, $2;
 dump e;
 {code}
 1.txt: 
 {code}
 ulysses thompson64  1.90
 katie carson25  3.65
 {code}
 2.txt:
 {code}
 luke king   0.73
 holly davidson  2.43
 {code}
 expected output:
 (ulysses thompson,1.90)
 (katie carson,3.65)
 (luke king,0.73)
 (holly davidson,2.43)
 real output:
 (ulysses thompson,)
 (katie carson,)
 (luke king,0.73)
 (holly davidson,2.43)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1146) Inconsistent column pruning in LOUnion

2009-12-18 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1146:


Attachment: PIG-1146-1.patch

 Inconsistent column pruning in LOUnion
 --

 Key: PIG-1146
 URL: https://issues.apache.org/jira/browse/PIG-1146
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1146-1.patch


 This happens when we do a union on two relations, if one column comes from a 
 loader, the other matching column comes from a constant, and this column get 
 pruned. We prune for the one from loader and did not prune the constant. Thus 
 leaves union an inconsistent state. Here is a script:
 {code}
 a = load '1.txt' as (a0, a1:chararray, a2);
 b = load '2.txt' as (b0, b2);
 c = foreach b generate b0, 'hello', b2;
 d = union a, c;
 e = foreach d generate $0, $2;
 dump e;
 {code}
 1.txt: 
 {code}
 ulysses thompson64  1.90
 katie carson25  3.65
 {code}
 2.txt:
 {code}
 luke king   0.73
 holly davidson  2.43
 {code}
 expected output:
 (ulysses thompson,1.90)
 (katie carson,3.65)
 (luke king,0.73)
 (holly davidson,2.43)
 real output:
 (ulysses thompson,)
 (katie carson,)
 (luke king,0.73)
 (holly davidson,2.43)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1164) [zebra]smoke test

2009-12-18 Thread Jing Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Huang updated PIG-1164:


Attachment: smoke.patch

Patch for the zebra smoke test. 
No unit test needed for this patch. 
Only changed build.xml to add smoke target and added environment setup file. 

 [zebra]smoke test
 -

 Key: PIG-1164
 URL: https://issues.apache.org/jira/browse/PIG-1164
 Project: Pig
  Issue Type: Test
Affects Versions: 0.6.0
Reporter: Jing Huang
 Fix For: 0.7.0

 Attachments: smoke.patch


 Change zebra build.xml file to add smoke target. 
 And env.sh and run script under zebra/src/test/smoke dir

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1146) Inconsistent column pruning in LOUnion

2009-12-18 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1146:


Status: Patch Available  (was: Open)

 Inconsistent column pruning in LOUnion
 --

 Key: PIG-1146
 URL: https://issues.apache.org/jira/browse/PIG-1146
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1146-1.patch


 This happens when we do a union on two relations, if one column comes from a 
 loader, the other matching column comes from a constant, and this column get 
 pruned. We prune for the one from loader and did not prune the constant. Thus 
 leaves union an inconsistent state. Here is a script:
 {code}
 a = load '1.txt' as (a0, a1:chararray, a2);
 b = load '2.txt' as (b0, b2);
 c = foreach b generate b0, 'hello', b2;
 d = union a, c;
 e = foreach d generate $0, $2;
 dump e;
 {code}
 1.txt: 
 {code}
 ulysses thompson64  1.90
 katie carson25  3.65
 {code}
 2.txt:
 {code}
 luke king   0.73
 holly davidson  2.43
 {code}
 expected output:
 (ulysses thompson,1.90)
 (katie carson,3.65)
 (luke king,0.73)
 (holly davidson,2.43)
 real output:
 (ulysses thompson,)
 (katie carson,)
 (luke king,0.73)
 (holly davidson,2.43)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception

2009-12-18 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou reassigned PIG-1153:
-

Assignee: Yan Zhou

 [zebra] spliting columns at different levels in a complex record column into 
 different column groups throws exception
 -

 Key: PIG-1153
 URL: https://issues.apache.org/jira/browse/PIG-1153
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Xuefu Zhang
Assignee: Yan Zhou
 Attachments: PIG-1153.patch


 The following code sample:
   String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, 
 r3:record(f3:float, f4));
   String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4];
   Partition p = new Partition(schema.toString(), strStorage, null);
 gives the following exception:
 org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set 
 on the same field: r2.f5

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception

2009-12-18 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1153:
--

Attachment: PIG-1153.patch

 [zebra] spliting columns at different levels in a complex record column into 
 different column groups throws exception
 -

 Key: PIG-1153
 URL: https://issues.apache.org/jira/browse/PIG-1153
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Xuefu Zhang
Assignee: Yan Zhou
 Attachments: PIG-1153.patch


 The following code sample:
   String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, 
 r3:record(f3:float, f4));
   String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4];
   Partition p = new Partition(schema.toString(), strStorage, null);
 gives the following exception:
 org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set 
 on the same field: r2.f5

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception

2009-12-18 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1153:
--

Status: Patch Available  (was: Open)

 [zebra] spliting columns at different levels in a complex record column into 
 different column groups throws exception
 -

 Key: PIG-1153
 URL: https://issues.apache.org/jira/browse/PIG-1153
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Xuefu Zhang
Assignee: Yan Zhou
 Attachments: PIG-1153.patch


 The following code sample:
   String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, 
 r3:record(f3:float, f4));
   String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4];
   Partition p = new Partition(schema.toString(), strStorage, null);
 gives the following exception:
 org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set 
 on the same field: r2.f5

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1165) Signature of loader does not set correctly for order by

2009-12-18 Thread Daniel Dai (JIRA)
Signature of loader does not set correctly for order by
---

 Key: PIG-1165
 URL: https://issues.apache.org/jira/browse/PIG-1165
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0


In pig, we need to set signature for each LoadFunc. Currently, we use alias of 
the LOAD statement in Pig script of the signature of the LoadFunc. One use case 
we have is in LoadFunc, we use signature to retrieve pruned columns of each 
specific loader. However, in order by statement, we do not set signature for 
the loader correctly. In this case, we do not prune the loader correctly. 

For example, the following script produce wrong result:

{code}
a = load '1.txt' as (a0, a1);
b = order a by a1;
c = order b by a1;
d = foreach c generate a1;
dump d;
{code}

1.txt:
{code}
1   a
2   b
3   c
6   d
5   e
{code}

expected result:
a
b
c
d
e

current result:
1
2
3
5
6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1162) Pig 0.6.0 - UDF doc

2009-12-18 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1162:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed to both trunk and 0.6 branch. Thanks, Corinne!

 Pig 0.6.0 - UDF doc
 ---

 Key: PIG-1162
 URL: https://issues.apache.org/jira/browse/PIG-1162
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
 Fix For: 0.6.0

 Attachments: pig-6-udf.patch


 Pig 0.6.0 - UDF doc
 Small corrections.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1130) In pig local ( hadoop local mode ) mode the counting of number of tuples and bytes is incorrect if data is more than one local split.

2009-12-18 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792770#action_12792770
 ] 

Jeff Zhang commented on PIG-1130:
-

Alan, I think one method is to check the type of FileSystem, if it is 
LocalFileSystem in MapReduce mode, then we should throw Exception. 

 In pig local ( hadoop local mode ) mode the counting of number of tuples and 
 bytes is incorrect if data is more than one local split.
 -

 Key: PIG-1130
 URL: https://issues.apache.org/jira/browse/PIG-1130
 Project: Pig
  Issue Type: Bug
Reporter: Ankit Modi
Priority: Minor

 If the output generates more than one part file, the current code only gives 
 stats of the first part file. ie. part-0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-18 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1149:
---

Attachment: pig_1149_lsr-branch.patch

Attaching patch for lsr branch.
I also retabbed the involved files to replace tabs with spaces, and got rid of 
some unused imports.

Note the FIXME in the test case, as discussed.

 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch, pig_1149_lsr-branch.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1158) pig command line -M option doesn't support table union correctly (comma seperated paths)

2009-12-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792786#action_12792786
 ] 

Hadoop QA commented on PIG-1158:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428482/PIG-1158.patch
  against trunk revision 892408.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/143/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/143/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/143/console

This message is automatically generated.

 pig command line -M option doesn't support table union correctly (comma 
 seperated paths)
 

 Key: PIG-1158
 URL: https://issues.apache.org/jira/browse/PIG-1158
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1158.patch


 for example, load (1.txt,2.txt) USING 
 org.apache.hadoop.zebra.pig.TableLoader()
 i see this errror from stand out:
 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: 
 hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/1.txt,2.txt does not 
 exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1163) Pig/Zebra 0.6.0 release - Doc Updates

2009-12-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792787#action_12792787
 ] 

Hadoop QA commented on PIG-1163:


-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12428494/zebra-6-update-1.patch
  against trunk revision 892416.

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/144/console

This message is automatically generated.

 Pig/Zebra 0.6.0 release - Doc Updates
 -

 Key: PIG-1163
 URL: https://issues.apache.org/jira/browse/PIG-1163
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.6.0
Reporter: Corinne Chandel
Assignee: Corinne Chandel
Priority: Blocker
 Fix For: 0.6.0

 Attachments: zebra-6-update-1.patch


 Pig/Zebra 0.6.0 release - Doc Updates
 Updates for the Zebra 0.6.0 docs.
 (1) First patch - please apply the first patch now (zebra-6-update-1.patch)
 (2) Second patch - depending on feeback, we may have a second patch to apply 
 Jan 4 or Jan 5
 Thanks/C

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.