[jira] Updated: (PIG-781) Error reporting for failed MR jobs

2009-05-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-781:
---

Attachment: partial_failure.patch

The latest patch is against the latest code base. It also includes the test 
with the done file. Finally, I was wrong about the log files. It's already 
the case that all the errors are logged into the same pig file.

 Error reporting for failed MR jobs
 --

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: partial_failure.patch, partial_failure.patch, 
 partial_failure.patch


 If we have multiple MR jobs to run and some of them fail the behavior of the 
 system is to not stop on the first failure but to keep going. That way jobs 
 that do not depend on the failed job might still succeed.
 The question is to how best report this scenario to a user. How do we tell 
 which jobs failed and which didn't?
 One way could be to tie jobs to stores and report which store locations won't 
 have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-781) Error reporting for failed MR jobs

2009-05-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709343#action_12709343
 ] 

Hadoop QA commented on PIG-781:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12408116/partial_failure.patch
  against trunk revision 774582.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 15 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 226 javac compiler warnings (more 
than the trunk's current 225 warnings).

-1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/39/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/39/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/39/console

This message is automatically generated.

 Error reporting for failed MR jobs
 --

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: partial_failure.patch, partial_failure.patch, 
 partial_failure.patch


 If we have multiple MR jobs to run and some of them fail the behavior of the 
 system is to not stop on the first failure but to keep going. That way jobs 
 that do not depend on the failed job might still succeed.
 The question is to how best report this scenario to a user. How do we tell 
 which jobs failed and which didn't?
 One way could be to tie jobs to stores and report which store locations won't 
 have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-809) number of input lines it processed, number of output lines it produced for PIG job

2009-05-14 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709440#action_12709440
 ] 

Alan Gates commented on PIG-809:


Is this a duplicate of PIG-619, which was just committed?

 number of input lines it processed, number of output lines it produced for 
 PIG job
 --

 Key: PIG-809
 URL: https://issues.apache.org/jira/browse/PIG-809
 Project: Pig
  Issue Type: Improvement
  Components: impl
 Environment: Linux
Reporter: Supreeth

 Excerpt from the mail conversation.
 It will be a great addition to Pig. Hadoop currently provides all these
 counters. All Pig has to do is to add them up for all Hadoop jobs in the
 script, and emit them at the end of the script. File a jira ?
 - Milind
 On 5/13/09 8:16 AM, Supreeth Hosur Nagesh Rao supre...@yahoo-inc.com
 wrote:
   Hi Olga
   
   With every PIG job is there any way for us to trap into the operational
   stats of that job, like number of input lines it processed, number of
   output lines it produced?
   
   I dont want to have a separate PIG script to do the same as it may be
   additional parsing, so is there such a stat. If not can that be
   provided, and exposed as a config parameter?
   
   -Supreeth
 This will be a great feature to have for our processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-05-14 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: Open)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in front of first
  * @param inputNum, indicates which input of the first operator 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-05-14 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Attachment: OptimizerPhase2.patch

Attaching a new patch for Optimizer Phase 2. The previous patch did not include 
a newly added file.

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-05-14 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Attachment: (was: OptimizerPhase2.patch)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in front of first
  * @param inputNum, indicates which input of the first 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-05-14 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: In Progress  (was: Patch Available)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in front of first
  * @param inputNum, indicates which input of the first 

[jira] Updated: (PIG-781) Error reporting for failed MR jobs

2009-05-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-781:
---

Attachment: partial_failure.patch

Fixing the findbugs warning.

 Error reporting for failed MR jobs
 --

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: partial_failure.patch, partial_failure.patch, 
 partial_failure.patch, partial_failure.patch


 If we have multiple MR jobs to run and some of them fail the behavior of the 
 system is to not stop on the first failure but to keep going. That way jobs 
 that do not depend on the failed job might still succeed.
 The question is to how best report this scenario to a user. How do we tell 
 which jobs failed and which didn't?
 One way could be to tie jobs to stores and report which store locations won't 
 have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-05-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-777:
---

Status: Patch Available  (was: Open)

 Code refactoring: Create optimization out of store/load post processing code
 

 Key: PIG-777
 URL: https://issues.apache.org/jira/browse/PIG-777
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: log_message.patch


 The postProcessing method in the pig server checks whether a logical graph 
 contains stores to and loads from the same location. If so, it will either 
 connect the store and load, or optimize by throwing out the load and 
 connecting the store predecessor with the successor of the load.
 Ideally the introduction of the store and load connection should happen in 
 the query compiler, while the optimization should then happen in an separate 
 optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-810) Scripts failing with NPE

2009-05-14 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-810:
---

Attachment: PIG-810.patch

 Scripts failing with NPE
 

 Key: PIG-810
 URL: https://issues.apache.org/jira/browse/PIG-810
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: PIG-810.patch


 Scripts such as:
 {code}
 a = load 'nosuchfile';
 b = store a into 'bla';
 {code}
 are failing with
 {code}
 ERROR 2043: Unexpected error during execution.
 org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected 
 error during execution.
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:275)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:757)
 at org.apache.pig.PigServer.execute(PigServer.java:750)
 at org.apache.pig.PigServer.access$100(PigServer.java:88)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:917)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:242)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:110)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:151)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:123)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:372)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.pig.tools.pigstats.PigStats.accumulateMRStats(PigStats.java:175)
 at 
 org.apache.pig.tools.pigstats.PigStats.accumulateStats(PigStats.java:94)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:148)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:262)
 ... 10 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-810) Scripts failing with NPE

2009-05-14 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-810:
---

Fix Version/s: 0.3.0
   Status: Patch Available  (was: Open)

 Scripts failing with NPE
 

 Key: PIG-810
 URL: https://issues.apache.org/jira/browse/PIG-810
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.3.0

 Attachments: PIG-810.patch


 Scripts such as:
 {code}
 a = load 'nosuchfile';
 b = store a into 'bla';
 {code}
 are failing with
 {code}
 ERROR 2043: Unexpected error during execution.
 org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected 
 error during execution.
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:275)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:757)
 at org.apache.pig.PigServer.execute(PigServer.java:750)
 at org.apache.pig.PigServer.access$100(PigServer.java:88)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:917)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:242)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:110)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:151)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:123)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:372)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.pig.tools.pigstats.PigStats.accumulateMRStats(PigStats.java:175)
 at 
 org.apache.pig.tools.pigstats.PigStats.accumulateStats(PigStats.java:94)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:148)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:262)
 ... 10 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-05-14 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709590#action_12709590
 ] 

Olga Natkovich commented on PIG-777:


Looks like the patch does more than just log message. I think you need to add a 
unit test for why this is needed:

store.setInputSpec(load.getInputFile());

 Code refactoring: Create optimization out of store/load post processing code
 

 Key: PIG-777
 URL: https://issues.apache.org/jira/browse/PIG-777
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
 Attachments: log_message.patch


 The postProcessing method in the pig server checks whether a logical graph 
 contains stores to and loads from the same location. If so, it will either 
 connect the store and load, or optimize by throwing out the load and 
 connecting the store predecessor with the successor of the load.
 Ideally the introduction of the store and load connection should happen in 
 the query compiler, while the optimization should then happen in an separate 
 optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-14 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709608#action_12709608
 ] 

Doug Cutting commented on PIG-794:
--

Looking at the patch, I have a few questions and remarks:
 - Why not name the records Tuple and Bag instead of T and B?  The 
names are not written in the data, so there's little advantage to shorter names.
 - Why not, instead of parsing the schema from Json, construct the schema using 
the Java Schema API?  Then you would not need to walk the schema afterwards to 
find union indexes, and you'd get compile-time API checking rather than 
potential load-time JSON parse errors.
 - Why not extend GenericDatumReader and override newRecord() to create either 
a Bag or a Tuple, then override addField() to add values to either a bag or 
tuple?  This would make the patch much smaller, and potentially permit you to 
eventually take advantage of GenericDatumReader features like projection and 
object reuse.
 - Finally, since you're using a pre-release version of Avro, you should 
probably name the jar with the subversion revision number.  Also note that, 
since Avro is not yet stable, it should not be yet used for persistent data in 
production systems.


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A proposal for changing pig's memory management

2009-05-14 Thread Ted Dunning
That Telegraph dataflow paper is pretty long in the tooth.  Certainly
several of their claims have little force any more (lack of non-blocking
I/O, poor thread performance, no unmap, very expensive synchronization for
uncontested locks).  It is worth that they did all of their tests on the 1.3
JVM and things have come an enormous way since then.

Certainly, it is worth having opaque contains based on byte arrays, but
isn't that pretty much what the NIO byte buffers are there to provide?
Wouldn't a virtual tuple type that was nothing more than a byte buffer, type
and an offset do almost all of what is proposed here?

On Thu, May 14, 2009 at 5:33 PM, Alan Gates ga...@yahoo-inc.com wrote:

 http://wiki.apache.org/pig/PigMemory

 Alan.



[jira] Updated: (PIG-619) Dumping empty results produces Unable to get results for /tmp/temp-1964806069/tmp256878619 org.apache.pig.builtin.BinStorage message

2009-05-14 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-619:
---

Fix Version/s: 0.3.0
   Status: Patch Available  (was: Open)

In order to see this behavior, you need three map reduce jobs, something like:

A = load
B = filter everything out
C = group
D = foreach
E = distinct
F = group
G = foreach
store G

In this case the first job (A-D) will run and produce 0 length part files.  The 
second job (E) will run, but no maps will be started because the files are zero 
length.  As a result Hadoop now seems to create no output files for this second 
job.  The third job (F-G) then fails complaining that the input files don't 
exist.  The patch changes pig's slicer to return at least one input split per 
part file even when the file is zero length.

 Dumping empty results produces Unable to get results for 
 /tmp/temp-1964806069/tmp256878619  org.apache.pig.builtin.BinStorage message
 ---

 Key: PIG-619
 URL: https://issues.apache.org/jira/browse/PIG-619
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop 18, Multi-node hadoop installation
Reporter: Viraj Bhat
Assignee: Alan Gates
 Fix For: 0.3.0

 Attachments: mydata.txt, PIG-619.patch, tmpfileload.pig


 Following pig script stores empty filter results into  'emptyfilteredlogs' 
 HDFS dir. It later reloads this data from an empty HDFS dir for additional 
 grouping and counting. It has been observed that this script, succeeds on a 
 single node hadoop installation with the following message as the alias 
 COUNT_EMPTYFILTERED_LOGS contains empty data.
 ==
 2009-01-13 21:47:08,988 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Success!
 ==
 But on a multi-node Hadoop installation, the script fails with the following 
 error:
 ==
 2009-01-13 13:48:34,602 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Success!
 java.io.IOException: Unable to open iterator for alias: 
 COUNT_EMPTYFILTERED_LOGS [Unable to get results for 
 /tmp/temp-1964806069/tmp256878619:org.apache.pig.builtin.BinStorage]
 at 
 org.apache.pig.backend.hadoop.executionengine.HJob.getResults(HJob.java:74)
 at org.apache.pig.PigServer.openIterator(PigServer.java:408)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
 at org.apache.pig.Main.main(Main.java:306)
 Caused by: org.apache.pig.backend.executionengine.ExecException: Unable to 
 get results for 
 /tmp/temp-1964806069/tmp256878619:org.apache.pig.builtin.BinStorage
 ... 7 more
 Caused by: java.io.IOException: /tmp/temp-1964806069/tmp256878619 does not 
 exist
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:188)
 at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.HJob.getResults(HJob.java:69)
 ... 6 more
 ==
 {code}
 RAW_LOGS = load 'mydata.txt' as (url:chararray, numvisits:int);
 RAW_LOGS = limit RAW_LOGS 2;
 FILTERED_LOGS = filter RAW_LOGS by numvisits  0;
 store FILTERED_LOGS into 'emptyfilteredlogs' using PigStorage();
 EMPTY_FILTERED_LOGS = load 'emptyfilteredlogs' as (url:chararray, 
 numvisits:int);
 GROUP_EMPTYFILTERED_LOGS = group EMPTY_FILTERED_LOGS by numvisits;
 COUNT_EMPTYFILTERED_LOGS = foreach GROUP_EMPTYFILTERED_LOGS generate
  group, COUNT(EMPTY_FILTERED_LOGS);
 explain COUNT_EMPTYFILTERED_LOGS;
 dump COUNT_EMPTYFILTERED_LOGS;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.