[jira] Commented: (PIG-797) Limit with ORDER BY producing wrong results

2009-06-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722178#action_12722178
 ] 

Hudson commented on PIG-797:


Integrated in Pig-trunk #480 (See 
[http://hudson.zones.apache.org/hudson/job/Pig-trunk/480/])
: Limit with ORDER BY producing wrong results


 Limit with ORDER BY producing wrong results
 ---

 Key: PIG-797
 URL: https://issues.apache.org/jira/browse/PIG-797
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-797-2.patch, PIG-797-3.patch, PIG-797.patch


 Query:
 A = load 'studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, SUM(A.gpa) as rev;
 D = order C by rev;
 E = limit D 10;
 dump E;
 Output:
 (alice king,31.7)
 (alice laertes,26.453)
 (alice thompson,25.867)
 (alice van buren,23.59)
 (bob allen,19.902)
 (bob ichabod,29.0)
 (bob king,28.454)
 (bob miller,10.28)
 (bob underhill,28.137)
 (bob van buren,25.992)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722177#action_12722177
 ] 

Hudson commented on PIG-697:


Integrated in Pig-trunk #480 (See 
[http://hudson.zones.apache.org/hudson/job/Pig-trunk/480/])
: Proposed improvements to pig's optimizer (sms)


 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
 OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first 

[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-20 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-851:
---

Attachment: (was: Pig_815_patch.txt)

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Attachments: patch_815.txt, patch_815.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820.patch

In addition to explanation above SampleOptimizer is introduced which visits the 
compiled MR plan to detect this pattern (MR operator containing only load-store 
followed by MR operator containing sampling job in map plan). If this pattern 
is present, SampleOptimizer deletes the unnecessary predecessor MR operator and 
replaces the POLoad of sampling job with RandomSampleLoader which uses the 
loader of its predecessor. 

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: pig-820.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Affects Version/s: 0.4.0
   Status: Patch Available  (was: Open)

Submitting for both 0.3 and 0.4 branches.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: pig-820.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Pig-Patch-minerva.apache.org #96

2009-06-20 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/

--
[...truncated 94729 lines...]
 [exec] [junit] at java.lang.Thread.run(Thread.java:619)
 [exec] [junit] 
 [exec] [junit] 09/06/21 01:44:40 INFO 
mapReduceLayer.JobControlCompiler: Setting up single store job
 [exec] [junit] 09/06/21 01:44:40 WARN mapred.JobClient: Use 
GenericOptionsParser for parsing the arguments. Applications should implement 
Tool for the same.
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/tmp/hadoop-hudson/mapred/system/job_200906210143_0002/job.jar. 
blk_2814448812558836834_1012
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Receiving block 
blk_2814448812558836834_1012 src: /127.0.0.1:43765 dest: /127.0.0.1:56831
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Receiving block 
blk_2814448812558836834_1012 src: /127.0.0.1:56011 dest: /127.0.0.1:37715
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Receiving block 
blk_2814448812558836834_1012 src: /127.0.0.1:46399 dest: /127.0.0.1:59396
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Received block 
blk_2814448812558836834_1012 of size 1430188 from /127.0.0.1
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: PacketResponder 0 
for block blk_2814448812558836834_1012 terminating
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:59396 is added to 
blk_2814448812558836834_1012 size 1430188
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Received block 
blk_2814448812558836834_1012 of size 1430188 from /127.0.0.1
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:37715 is added to 
blk_2814448812558836834_1012 size 1430188
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: PacketResponder 1 
for block blk_2814448812558836834_1012 terminating
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Received block 
blk_2814448812558836834_1012 of size 1430188 from /127.0.0.1
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:56831 is added to 
blk_2814448812558836834_1012 size 1430188
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: PacketResponder 2 
for block blk_2814448812558836834_1012 terminating
 [exec] [junit] 09/06/21 01:44:40 INFO fs.FSNamesystem: Increasing 
replication for file 
/tmp/hadoop-hudson/mapred/system/job_200906210143_0002/job.jar. New replication 
is 2
 [exec] [junit] 09/06/21 01:44:40 INFO fs.FSNamesystem: Reducing 
replication for file 
/tmp/hadoop-hudson/mapred/system/job_200906210143_0002/job.jar. New replication 
is 2
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* ask 
127.0.0.1:37715 to delete  blk_2814448812558836834_1012
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/tmp/hadoop-hudson/mapred/system/job_200906210143_0002/job.split. 
blk_-7415742873646036385_1013
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Receiving block 
blk_-7415742873646036385_1013 src: /127.0.0.1:46400 dest: /127.0.0.1:59396
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Receiving block 
blk_-7415742873646036385_1013 src: /127.0.0.1:48873 dest: /127.0.0.1:59231
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Receiving block 
blk_-7415742873646036385_1013 src: /127.0.0.1:56015 dest: /127.0.0.1:37715
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Received block 
blk_-7415742873646036385_1013 of size 14547 from /127.0.0.1
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: PacketResponder 0 
for block blk_-7415742873646036385_1013 terminating
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Received block 
blk_-7415742873646036385_1013 of size 14547 from /127.0.0.1
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:37715 is added to 
blk_-7415742873646036385_1013 size 14547
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: PacketResponder 1 
for block blk_-7415742873646036385_1013 terminating
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: Received block 
blk_-7415742873646036385_1013 of size 14547 from /127.0.0.1
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:59231 is added to 
blk_-7415742873646036385_1013 size 14547
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.DataNode: PacketResponder 2 
for block blk_-7415742873646036385_1013 terminating
 [exec] [junit] 09/06/21 01:44:40 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap 

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722286#action_12722286
 ] 

Hadoop QA commented on PIG-820:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12411325/pig-820.patch
  against trunk revision 786694.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/console

This message is automatically generated.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: pig-820.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.