Re: Our release process

2012-12-12 Thread Julien Le Dem
I think we all agree here, let's not jump to conclusions.
Everything in this branch I am talking about is in Apache Pig. Everything
we do in Pig is contributed.
We have a branch for 0.11 where we keep merging the official 0.11 branch
plus a few patches (and it will stay small) that are only in Apache TRUNK.
The goal here is to help keeping the release branch stable by not adding
patches that are only useful to us.
Having this branch allows us to fix anything quickly and redeploy to
production. It is also what allows us to use the pig 0.11 branch in
production before it is even released.
This definitely benefits the community and helps making 0.11 stable.
This is a very reasonable way to keep using a recent version of Pig in
production.

Olga: My goal is to decrease the scope of what is going in the release
branch and to make sure we add only bug fixes that are not making it
unstable. I also think having a short definition of this helps which is why
I have been chiming in.
Let us know how you want to decrease the scope. I'm just trying to simplify
here.

Julien



On Tue, Dec 11, 2012 at 8:54 AM, Prashant Kommireddi prash1...@gmail.comwrote:

 Share the same concern as Russell here. Not great for the project for
 everyone to go private branch approach.

 On Tue, Dec 11, 2012 at 8:33 AM, Russell Jurney russell.jur...@gmail.com
 wrote:

  Wait. Ack. Do we want everyone to do this? This sounds like
 fragmentation.
  :(
 
  Russell Jurney twitter.com/rjurney
 
 
  On Dec 10, 2012, at 3:24 PM, Olga Natkovich onatkov...@yahoo.com
 wrote:
 
   If everybody is using a private branch then
  
   (1) We are not serving a significant part of our community
   (2) There is no motivation to contribute those patches to branches
 (only
  to trunk).
  
   Yahoo has been trying hard to work of the Apache branches but if we
  increase the scope of what is going into branches, we will go with
 private
  branch approach as well.
  
   Olga
  
  
   
   From: Julien Le Dem jul...@twitter.com
   To: Olga Natkovich onatkov...@yahoo.com
   Cc: dev@pig.apache.org dev@pig.apache.org; Santhosh M S 
  santhosh_mut...@yahoo.com; billgra...@gmail.com billgra...@gmail.com
 
   Sent: Friday, December 7, 2012 3:54 PM
   Subject: Re: Our release process
  
   Here's my criteria for inclusion in a release branch:
   - no new feature. Only bug fixes.
   - The criteria is more about stability than priority. The person/group
   asking for it has a good reason for wanting it in the branch. If
  commiters
   think the patch is reasonable and won't make the branch unstable then
 we
   should check it in. If it breaks something anyway, we revert it.
  
   For what it's worth we (at Twitter) maintain an internal branch where
 we
   add patches we need and I would suggest anybody that wants to be able
 to
   make emergency fixes to their own deployment to do the same. We do keep
   that branch as close to apache as we can but it has a few patches that
  are
   in trunk only and do not satisfy the no new feature criteria.
  
   What does the PMC think ?
  
   Julien
  
  
  
  
   On Tue, Dec 4, 2012 at 12:46 PM, Olga Natkovich onatkov...@yahoo.com
  wrote:
  
   I am ok with tests running nightly and reverting patches that cause
   failures. We used to have that. Does anybody know what happened? Is
  anybody
   volunteering to make it work again?
  
   I would like to see specific criteria for what goes into the branch
 been
   published (rather than case-by-case). This way each team can decided
 if
  the
   criteria stringent enough of if they need to run a private branch.
  
   Olga
  
  --
   *From:* Santhosh M S santhosh_mut...@yahoo.com
   *To:* Julien Le Dem jul...@twitter.com; dev@pig.apache.org 
   dev@pig.apache.org
   *Cc:* billgra...@gmail.com billgra...@gmail.com
   *Sent:* Friday, November 30, 2012 11:46 PM
  
   *Subject:* Re: Our release process
  
   HI Julien,
  
   You are making most of the points that I did on this thread (CI for
 e2e,
   not burdening clean e2e prior to every commit for a release branch).
 The
   only point on which there is no clear agreement is the definition of a
  bug
   that can be included in a previously released branch. I am fine with a
  case
   by case inclusion.
  
   Hi Olga,
  
   Are you fine with Julien's proposal as it stands - bugs that are
  included
   will be determined at the time of inclusion instead of doing it now.
  
   Santhosh
  
  
   
   From: Julien Le Dem jul...@twitter.com
   To: dev@pig.apache.org; Santhosh M S santhosh_mut...@yahoo.com
   Cc: billgra...@gmail.com billgra...@gmail.com
   Sent: Friday, November 30, 2012 5:37 PM
   Subject: Re: Our release process
  
   Proposed criteria:
   - it makes the tests fail. targets test-commit + test + e2e tests
   - a critical bug is reported in a short time frame (definition of
   critical not needed as it is rare and can be decided on a case 

Re: Our release process

2012-12-12 Thread Olga Natkovich
Hi Julien,

I understand what you are trying to do and I can see that being able to make 
more fixes post release has value for some use cases. My concern is that 
things that do not destabilize the branch is fairly subjective and also not 
always easy to ascertain beyond trivial changes. The only way I know to keep a 
code stable is to limit the updates. Also we need to clearly state what the 
constrains are for a post release commits so that every user can decide whether 
it works for them.

Olga



From: Julien Le Dem jul...@twitter.com
To: dev@pig.apache.org dev@pig.apache.org 
Sent: Wednesday, December 12, 2012 10:26 AM
Subject: Re: Our release process

I think we all agree here, let's not jump to conclusions.
Everything in this branch I am talking about is in Apache Pig. Everything
we do in Pig is contributed.
We have a branch for 0.11 where we keep merging the official 0.11 branch
plus a few patches (and it will stay small) that are only in Apache TRUNK.
The goal here is to help keeping the release branch stable by not adding
patches that are only useful to us.
Having this branch allows us to fix anything quickly and redeploy to
production. It is also what allows us to use the pig 0.11 branch in
production before it is even released.
This definitely benefits the community and helps making 0.11 stable.
This is a very reasonable way to keep using a recent version of Pig in
production.

Olga: My goal is to decrease the scope of what is going in the release
branch and to make sure we add only bug fixes that are not making it
unstable. I also think having a short definition of this helps which is why
I have been chiming in.
Let us know how you want to decrease the scope. I'm just trying to simplify
here.

Julien



On Tue, Dec 11, 2012 at 8:54 AM, Prashant Kommireddi prash1...@gmail.comwrote:

 Share the same concern as Russell here. Not great for the project for
 everyone to go private branch approach.

 On Tue, Dec 11, 2012 at 8:33 AM, Russell Jurney russell.jur...@gmail.com
 wrote:

  Wait. Ack. Do we want everyone to do this? This sounds like
 fragmentation.
  :(
 
  Russell Jurney twitter.com/rjurney
 
 
  On Dec 10, 2012, at 3:24 PM, Olga Natkovich onatkov...@yahoo.com
 wrote:
 
   If everybody is using a private branch then
  
   (1) We are not serving a significant part of our community
   (2) There is no motivation to contribute those patches to branches
 (only
  to trunk).
  
   Yahoo has been trying hard to work of the Apache branches but if we
  increase the scope of what is going into branches, we will go with
 private
  branch approach as well.
  
   Olga
  
  
   
   From: Julien Le Dem jul...@twitter.com
   To: Olga Natkovich onatkov...@yahoo.com
   Cc: dev@pig.apache.org dev@pig.apache.org; Santhosh M S 
  santhosh_mut...@yahoo.com; billgra...@gmail.com billgra...@gmail.com
 
   Sent: Friday, December 7, 2012 3:54 PM
   Subject: Re: Our release process
  
   Here's my criteria for inclusion in a release branch:
   - no new feature. Only bug fixes.
   - The criteria is more about stability than priority. The person/group
   asking for it has a good reason for wanting it in the branch. If
  commiters
   think the patch is reasonable and won't make the branch unstable then
 we
   should check it in. If it breaks something anyway, we revert it.
  
   For what it's worth we (at Twitter) maintain an internal branch where
 we
   add patches we need and I would suggest anybody that wants to be able
 to
   make emergency fixes to their own deployment to do the same. We do keep
   that branch as close to apache as we can but it has a few patches that
  are
   in trunk only and do not satisfy the no new feature criteria.
  
   What does the PMC think ?
  
   Julien
  
  
  
  
   On Tue, Dec 4, 2012 at 12:46 PM, Olga Natkovich onatkov...@yahoo.com
  wrote:
  
   I am ok with tests running nightly and reverting patches that cause
   failures. We used to have that. Does anybody know what happened? Is
  anybody
   volunteering to make it work again?
  
   I would like to see specific criteria for what goes into the branch
 been
   published (rather than case-by-case). This way each team can decided
 if
  the
   criteria stringent enough of if they need to run a private branch.
  
   Olga
  
      --
   *From:* Santhosh M S santhosh_mut...@yahoo.com
   *To:* Julien Le Dem jul...@twitter.com; dev@pig.apache.org 
   dev@pig.apache.org
   *Cc:* billgra...@gmail.com billgra...@gmail.com
   *Sent:* Friday, November 30, 2012 11:46 PM
  
   *Subject:* Re: Our release process
  
   HI Julien,
  
   You are making most of the points that I did on this thread (CI for
 e2e,
   not burdening clean e2e prior to every commit for a release branch).
 The
   only point on which there is no clear agreement is the definition of a
  bug
   that can be included in a previously released branch. I am fine with a
  case
   by case 

[jira] [Commented] (PIG-3020) Duplicate uid in schema error when joining two relations derived from the same load statement

2012-12-12 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13530255#comment-13530255
 ] 

Julien Le Dem commented on PIG-3020:


[~dvryaboy] I just noticed it was logging a warning with a NullPointerException 
when running tests from eclipse. I just fixed the log line to something 
clearer. It is not related but I feel it is small enough to be done here.
[~jcoveney] I also added a unit test with a pig script that was failing before 
and works now to validate my change.

 Duplicate uid in schema error when joining two relations derived from the 
 same load statement
 ---

 Key: PIG-3020
 URL: https://issues.apache.org/jira/browse/PIG-3020
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Julien Le Dem
 Attachments: PIG-3020.patch


 The following vali=dates OK with pig 0.9 and fails with the following error 
 in 0.11 (and I suspect 0.10)
 pig -c debug2.pig
 Script: debug2.pig
 {noformat}
 A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
 uids_with_flock:bag{});
 edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
 IsEmpty(uids_with_flock);
 edges_both = FOREACH edges_both GENERATE
 group.uid AS src_id,
 group.dst_id AS dst_id;
 both_counts = GROUP edges_both BY src_id;
 both_counts = FOREACH both_counts GENERATE
 group AS src_id, SIZE(edges_both) AS size_both;
 edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
 edges_bq = FOREACH edges_bq GENERATE
 group.uid AS src_id,
 group.dst_id AS dst_id;
 bq_counts = GROUP edges_bq BY src_id;
 bq_counts = FOREACH bq_counts GENERATE
 group AS src_id, SIZE(edges_bq) AS size_bq;
 per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
 src_id;
 store per_user_set_sizes into  'foo';
 {noformat}
 Error:
 {noformat}
 ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
 bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
 explain alias null
   at org.apache.pig.PigServer.explain(PigServer.java:999)
   at 
 org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
   at 
 org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
   at org.apache.pig.Main.run(Main.java:600)
   at org.apache.pig.Main.main(Main.java:154)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
 Error processing rule LoadTypeCastInserter
   at 
 org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
   at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
   at org.apache.pig.PigServer.explain(PigServer.java:984)
   ... 10 more
 Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
 Logical plan invalid state: duplicate uid in schema : 
 bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
   at 
 org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
   at 
 org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
   at 
 org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
   at 
 org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
   ... 13 more
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-2802) Wrong Schema generated when there is a dangling alias

2012-12-12 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi resolved PIG-2802.
---

Resolution: Duplicate

 Wrong Schema generated when there is a dangling alias
 -

 Key: PIG-2802
 URL: https://issues.apache.org/jira/browse/PIG-2802
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.2, 0.10.0
Reporter: Anitha Raju

 Hi,
 Script
 {code}
 A = load 'test.txt' using PigStorage() AS (x:int,y:int, z:int) ;
 B = GROUP A BY x;
 C = foreach B generate A.x as s;
 describe C; -- C: {s: {(x: int)}}
 D = FOREACH B {
E = ORDER A by y;
GENERATE A.x as s;
 };
 describe D; -- D: {x: int,y: int,z: int}
 {code}
 Here E is a dangling alias. 
 Regards,
 Anitha

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3093) Self join + realias results in schema errors

2012-12-12 Thread Jonathan Coveney (JIRA)
Jonathan Coveney created PIG-3093:
-

 Summary: Self join + realias results in schema errors
 Key: PIG-3093
 URL: https://issues.apache.org/jira/browse/PIG-3093
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.12
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
Priority: Critical
 Fix For: 0.12


So this one took a while to isolate, but is pretty crazy.

{code}
A = load 'a' as (field1:chararray);
B = foreach A generate *;
C = join A by field1, B by field1;
D = foreach C generate A::field1 as field2, B::field1;
describe D;
/*
D: {
field2: chararray,
B::field1: chararray
}
*/
E = foreach D generate field2, field1;
describe E;
/*
E: {
B::field1: chararray,
B::field1: chararray
}
*/
F = foreach E generate field2;
store F into 'fail';
-- file cristian_simpler.pig, line 20, column 4 Invalid field projection. 
Projected field [field2] does not exist in schema: 
B::field1:chararray,B::field1:chararray.
{code}

If you take a look at that code snippet, that is pretty nuts! Since the 2 
fields come from the same original table, renaming one causes issues with both. 
WUT. The even weirder part is not that they both get renamed, but that they 
both become the unrenamed value.

Interestingly, flipping the value of the projection changes the order of the 
output, so it looks like it's whatever the final reference is. ie

{code}
A = load 'a' as (field1:chararray);
B = foreach A generate *;
C = join A by field1, B by field1;
D = foreach C generate B::field1, A::field1 as field2;
describe D;
E = foreach D generate field2, field1;
describe E;
F = foreach E generate field2;
store F into 'fail';
{code}

results in
{code}

D: {
B::field1: chararray,
field2: chararray
}
E: {
field2: chararray,
field2: chararray
}
2012-12-13 00:13:10,045 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1025: 
file simplest.pig, line 8, column 23 Invalid field projection. Projected 
field [field2] does not exist in schema: field2:chararray,field2:chararray.
{code}

This seems to imply the solution: make copies of the Schema. I added a test and 
will hopefully have a patch soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3093) Self join + realias results in schema errors

2012-12-12 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13530527#comment-13530527
 ] 

Jonathan Coveney commented on PIG-3093:
---

One thing also to look into (once the initial patch is done) is to make sure 
that the data is correct.

 Self join + realias results in schema errors
 

 Key: PIG-3093
 URL: https://issues.apache.org/jira/browse/PIG-3093
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.12
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
Priority: Critical
 Fix For: 0.12


 So this one took a while to isolate, but is pretty crazy.
 {code}
 A = load 'a' as (field1:chararray);
 B = foreach A generate *;
 C = join A by field1, B by field1;
 D = foreach C generate A::field1 as field2, B::field1;
 describe D;
 /*
 D: {
 field2: chararray,
 B::field1: chararray
 }
 */
 E = foreach D generate field2, field1;
 describe E;
 /*
 E: {
 B::field1: chararray,
 B::field1: chararray
 }
 */
 F = foreach E generate field2;
 store F into 'fail';
 -- file cristian_simpler.pig, line 20, column 4 Invalid field projection. 
 Projected field [field2] does not exist in schema: 
 B::field1:chararray,B::field1:chararray.
 {code}
 If you take a look at that code snippet, that is pretty nuts! Since the 2 
 fields come from the same original table, renaming one causes issues with 
 both. WUT. The even weirder part is not that they both get renamed, but that 
 they both become the unrenamed value.
 Interestingly, flipping the value of the projection changes the order of the 
 output, so it looks like it's whatever the final reference is. ie
 {code}
 A = load 'a' as (field1:chararray);
 B = foreach A generate *;
 C = join A by field1, B by field1;
 D = foreach C generate B::field1, A::field1 as field2;
 describe D;
 E = foreach D generate field2, field1;
 describe E;
 F = foreach E generate field2;
 store F into 'fail';
 {code}
 results in
 {code}
 D: {
 B::field1: chararray,
 field2: chararray
 }
 E: {
 field2: chararray,
 field2: chararray
 }
 2012-12-13 00:13:10,045 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 file simplest.pig, line 8, column 23 Invalid field projection. Projected 
 field [field2] does not exist in schema: field2:chararray,field2:chararray.
 {code}
 This seems to imply the solution: make copies of the Schema. I added a test 
 and will hopefully have a patch soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-12 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13530543#comment-13530543
 ] 

Cheolsoo Park commented on PIG-3015:


Hi Joe,

Can you please add the missing files to the patch?
{code}
Exception in thread main java.io.FileNotFoundException: 
data/json/recordsWithDoubleUnderscores.json (No such file or directory)
Exception in thread main java.io.FileNotFoundException: data/json/arrays.json 
(No such file or directory)
Exception in thread main java.io.FileNotFoundException: 
data/json/arraysAsOutputByPig.json (No such file or directory)
{code}
I can't run your test cases.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Our release process

2012-12-12 Thread Julien Le Dem
Agreed. The priority of a change is subjective as well.
My definition for inclusion on the release branch:
- Only bug fixes.
- Only if they have fairly understood repercussions (up to the committers
who +/-1 as usual).
- If we thought it would not break things but still does (CI or externally
reported failure) we revert it.
What do you want to add/change? Please reformulate those rules the way you
like and let's see how we can converge.
(Also, let's keep it short for clarity)

Julien

On Wed, Dec 12, 2012 at 11:08 AM, Olga Natkovich onatkov...@yahoo.comwrote:

 Hi Julien,

 I understand what you are trying to do and I can see that being able to
 make more fixes post release has value for some use cases. My concern is
 that things that do not destabilize the branch is fairly subjective and
 also not always easy to ascertain beyond trivial changes. The only way I
 know to keep a code stable is to limit the updates. Also we need to clearly
 state what the constrains are for a post release commits so that every user
 can decide whether it works for them.

 Olga


 
 From: Julien Le Dem jul...@twitter.com
 To: dev@pig.apache.org dev@pig.apache.org
 Sent: Wednesday, December 12, 2012 10:26 AM
 Subject: Re: Our release process

 I think we all agree here, let's not jump to conclusions.
 Everything in this branch I am talking about is in Apache Pig. Everything
 we do in Pig is contributed.
 We have a branch for 0.11 where we keep merging the official 0.11 branch
 plus a few patches (and it will stay small) that are only in Apache TRUNK.
 The goal here is to help keeping the release branch stable by not adding
 patches that are only useful to us.
 Having this branch allows us to fix anything quickly and redeploy to
 production. It is also what allows us to use the pig 0.11 branch in
 production before it is even released.
 This definitely benefits the community and helps making 0.11 stable.
 This is a very reasonable way to keep using a recent version of Pig in
 production.

 Olga: My goal is to decrease the scope of what is going in the release
 branch and to make sure we add only bug fixes that are not making it
 unstable. I also think having a short definition of this helps which is why
 I have been chiming in.
 Let us know how you want to decrease the scope. I'm just trying to simplify
 here.

 Julien



 On Tue, Dec 11, 2012 at 8:54 AM, Prashant Kommireddi prash1...@gmail.com
 wrote:

  Share the same concern as Russell here. Not great for the project for
  everyone to go private branch approach.
 
  On Tue, Dec 11, 2012 at 8:33 AM, Russell Jurney 
 russell.jur...@gmail.com
  wrote:
 
   Wait. Ack. Do we want everyone to do this? This sounds like
  fragmentation.
   :(
  
   Russell Jurney twitter.com/rjurney
  
  
   On Dec 10, 2012, at 3:24 PM, Olga Natkovich onatkov...@yahoo.com
  wrote:
  
If everybody is using a private branch then
   
(1) We are not serving a significant part of our community
(2) There is no motivation to contribute those patches to branches
  (only
   to trunk).
   
Yahoo has been trying hard to work of the Apache branches but if we
   increase the scope of what is going into branches, we will go with
  private
   branch approach as well.
   
Olga
   
   

From: Julien Le Dem jul...@twitter.com
To: Olga Natkovich onatkov...@yahoo.com
Cc: dev@pig.apache.org dev@pig.apache.org; Santhosh M S 
   santhosh_mut...@yahoo.com; billgra...@gmail.com 
 billgra...@gmail.com
  
Sent: Friday, December 7, 2012 3:54 PM
Subject: Re: Our release process
   
Here's my criteria for inclusion in a release branch:
- no new feature. Only bug fixes.
- The criteria is more about stability than priority. The
 person/group
asking for it has a good reason for wanting it in the branch. If
   commiters
think the patch is reasonable and won't make the branch unstable then
  we
should check it in. If it breaks something anyway, we revert it.
   
For what it's worth we (at Twitter) maintain an internal branch where
  we
add patches we need and I would suggest anybody that wants to be able
  to
make emergency fixes to their own deployment to do the same. We do
 keep
that branch as close to apache as we can but it has a few patches
 that
   are
in trunk only and do not satisfy the no new feature criteria.
   
What does the PMC think ?
   
Julien
   
   
   
   
On Tue, Dec 4, 2012 at 12:46 PM, Olga Natkovich 
 onatkov...@yahoo.com
   wrote:
   
I am ok with tests running nightly and reverting patches that cause
failures. We used to have that. Does anybody know what happened? Is
   anybody
volunteering to make it work again?
   
I would like to see specific criteria for what goes into the branch
  been
published (rather than case-by-case). This way each team can decided
  if
   the
criteria stringent enough of if they need to run a 

[jira] Subscription: PIG patch available

2012-12-12 Thread jira
Issue Subscription
Filter: PIG patch available (37 issues)

Subscriber: pigdaily

Key Summary
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3086Allow A Prefix To Be Added To URIs In PigUnit Tests 
https://issues.apache.org/jira/browse/PIG-3086
PIG-3085Errors and lacks in document Built In Functions
https://issues.apache.org/jira/browse/PIG-3085
PIG-3078Make a UDF that, given a string, returns just the columns prefixed 
by that string
https://issues.apache.org/jira/browse/PIG-3078
PIG-3073POUserFunc creating log spam for large scripts
https://issues.apache.org/jira/browse/PIG-3073
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3067HBaseStorage should be split up to become more managable
https://issues.apache.org/jira/browse/PIG-3067
PIG-3066Fix TestPigRunner in trunk
https://issues.apache.org/jira/browse/PIG-3066
PIG-3057make readField protected to be able to override it if we extend 
PigStorage
https://issues.apache.org/jira/browse/PIG-3057
PIG-3051java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
ColumnPruning
https://issues.apache.org/jira/browse/PIG-3051
PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for 
cross-platform execution
https://issues.apache.org/jira/browse/PIG-3029
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2957TetsScriptUDF fail due to volume prefix in jar
https://issues.apache.org/jira/browse/PIG-2957
PIG-2956Invalid cache specification for some streaming statement
https://issues.apache.org/jira/browse/PIG-2956
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2878Pig current releases lack a UDF equalIgnoreCase.This function 
returns a Boolean value indicating whether string left is equal to string 
right. This check is case insensitive.
https://issues.apache.org/jira/browse/PIG-2878
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2834MultiStorage requires unused constructor argument
https://issues.apache.org/jira/browse/PIG-2834
PIG-2824Pushing checking number of fields into LoadFunc
https://issues.apache.org/jira/browse/PIG-2824
PIG-2661Pig uses an extra job for loading data in Pigmix L9
https://issues.apache.org/jira/browse/PIG-2661
PIG-2645PigSplit does not handle the case where SerializationFactory 
returns null
https://issues.apache.org/jira/browse/PIG-2645
PIG-2614AvroStorage crashes on LOADING a single bad error
https://issues.apache.org/jira/browse/PIG-2614
PIG-2507Semicolon in paramenters for UDF results in parsing error
https://issues.apache.org/jira/browse/PIG-2507
PIG-2433Jython import module not working if module path is in classpath
https://issues.apache.org/jira/browse/PIG-2433
PIG-2417Streaming UDFs -  allow users to easily write UDFs in scripting 
languages with no JVM implementation.
https://issues.apache.org/jira/browse/PIG-2417
PIG-2362Rework Ant build.xml to use macrodef instead of antcall
https://issues.apache.org/jira/browse/PIG-2362
PIG-2341Need better documentation on Pig/HBase integration
https://issues.apache.org/jira/browse/PIG-2341
PIG-2312NPE when relation and column share the same name and used in Nested 
Foreach 
https://issues.apache.org/jira/browse/PIG-2312
PIG-1942script UDF (jython) should utilize the intended output schema to 
more directly convert Py objects to Pig objects

[jira] [Commented] (PIG-2553) Pig shouldn't allow attempts to write multiple relations into same directory

2012-12-12 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13530588#comment-13530588
 ] 

Cheolsoo Park commented on PIG-2553:


Sorry for the delay. Here are some comments. Please let me know what you think.
- Wouldn't it make more sense to make it public since this property is a public 
property, and this variable may be reused somewhere else in the future? Do you 
agree?
{code}
private static final String PIG_LOCATION_CHECK_STRICT = 
pig.location.check.strict;
{code}
- Can you check whether {{PIG_LOCATION_CHECK_STRICT}} is enabled before calling 
{{getStoreLocIfInvalid(storeOps)}} since then we can avoid calling it when 
unnecessary?
{code}
LOStore invalidStore = getStoreLocIfInvalid(storeOps);
if (invalidStore != null  
true.equals(pigContext.getProperties().getProperty(PIG_LOCATION_CHECK_STRICT)))
 {
throw new RuntimeException(Script contains 2 or more STORE statements 
writing to same location : + invalidStore.getFileSpec().getFileName());
}
{code}
- Wouldn't it make more sense for {{getStoreLocIfInvalid()}} to return the 
filename as {{String}} instead of {{LOStore}}? {{LOStore}} seems unnecessary to 
me.
- I am not sure if creating the {{admin}} section in the docs makes sense. Even 
if admin sets this property, users always can override it running Pig with 
{{-Dpig.location.check.strict=false}}. So I don't think that this property is 
different from any other user properties. Can we document it in 
{{conf/pig.property}} like we did for other properties? Do you agree?

 Pig shouldn't allow attempts to write multiple relations into same directory
 

 Key: PIG-2553
 URL: https://issues.apache.org/jira/browse/PIG-2553
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Assignee: Prashant Kommireddi
 Attachments: PIG-2553_1.patch, PIG-2553.patch


 We've seen multiple occasions where users accidentally try to store 2 or more 
 different relations to the same destination directory. Currently, this passes 
 the Pig planner and fails on MR side due to concurrent attempts to create the 
 same part file on the reducer. This is extremely confusing to the user, and 
 hard to debug.
 We should instead fail their scripts before they are even submitted, since we 
 can identify the erroneous condition from the beginning.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3010) Allow UDF's to flatten themselves

2012-12-12 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13530601#comment-13530601
 ] 

Dmitriy V. Ryaboy commented on PIG-3010:


can you regenerate without the ws changes? 285Kb patch..

 Allow UDF's to flatten themselves
 -

 Key: PIG-3010
 URL: https://issues.apache.org/jira/browse/PIG-3010
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.12

 Attachments: PIG-3010-0.patch, PIG-3010-1.patch, PIG-3010-2.patch


 This is something I thought would be cool for a while, so I sat down and did 
 it because I think there are some useful debugging tools it'd help with.
 The idea is that if you attach an annotation to a UDF, the Tuple or DataBag 
 you output will be flattened. This is quite powerful. A very common pattern 
 is:
 a = foreach data generate Flatten(MyUdf(thing)) as (a,b,c);
 This would let you just do:
 a = foreach data generate MyUdf(thing);
 With the exact same result!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2857) Add a -tagPath option to PigStorage

2012-12-12 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13530613#comment-13530613
 ] 

Cheolsoo Park commented on PIG-2857:


Overall looks good to me. Two comments:
- Can you update the doc ({{func.xml}}) too? We probably should keep 
{{tagsource}} yet indicate it as depreciated in the doc.
- Can you update the comment of {{testPigStorageSourceTagSchema()}}? It still 
mentions {{tagsource}}.
{code}
/** 
 * This is for testing source tagging option on PigStorage. When a user
 * specifies '-tagsource' as an option, PigStorage must prepend the input
 * source path to the tuple and INPUT_FILE_NAME to schema.
 * 
 * @throws Exception
 */
{code}
- Can you update the following line in {{testPigStorageSourceTagValue()}}? It 
still mentions {{tagsource}}.
{code}
assertEquals(tagsource value must be part-m-0, inputFileName, 
storeFileName);
{code}
- Can you add some comment to code that tests {{-tagPath}} in 
{{testPigStorageSourceTagSchema()}} just to be clear?
- Can you remove tabs in the patch?

Thanks!

 Add a -tagPath option to PigStorage
 ---

 Key: PIG-2857
 URL: https://issues.apache.org/jira/browse/PIG-2857
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Prashant Kommireddi
 Attachments: PIG-2857_1.patch, PIG-2857.patch


 We recently added a -tagSource option to PigStorage, which allows us to add 
 filenames from which records come to the returned tuples.
 Often, users want the whole path, not just the source file. I propose we add 
 a -tagPath option to do this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3094) ERROR 2229: Couldn't find matching uid -1 in Pig 0.10.0

2012-12-12 Thread Navneet Kapur (JIRA)
Navneet Kapur created PIG-3094:
--

 Summary:  ERROR 2229: Couldn't find matching uid -1 in Pig 0.10.0
 Key: PIG-3094
 URL: https://issues.apache.org/jira/browse/PIG-3094
 Project: Pig
  Issue Type: Bug
Reporter: Navneet Kapur
 Fix For: 0.10.0


I'm getting the error message:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2229: Couldn't find matching uid 
-1 for project (Name: Project Type: bytearray Uid: 2754 Input: 0 Column: 4)

This seems to have been solved for versions 0.8 and 0.9. 
https://issues.apache.org/jira/browse/PIG-1979

For privacy reasons, I am unable to post the code here. The stack-trace that I 
get is as follows:
Pig Stack Trace
---
ERROR 2229: Couldn't find matching uid -1 for project (Name: Project Type: 
bytearray Uid: 2754 Input: 0 Column: 4)

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: Error 
processing rule ColumnMapKeyPrune. Try -t ColumnMapKeyPrune
at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:282)
at org.apache.pig.PigServer.compilePp(PigServer.java:1316)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1253)
at org.apache.pig.PigServer.execute(PigServer.java:1245)
at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:555)
at org.apache.pig.Main.main(Main.java:111)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2229: 
Couldn't find matching uid -1 for project (Name: Project Type: bytearray Uid: 
2754 Input: 0 Column: 4)
at 
org.apache.pig.newplan.logical.optimizer.ProjectionPatcher$ProjectionRewriter.visit(ProjectionPatcher.java:91)
at 
org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:207)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at 
org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:136)
at 
org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at 
org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:114)
at 
org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:75)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at 
org.apache.pig.newplan.logical.optimizer.ProjectionPatcher.transformed(ProjectionPatcher.java:48)
at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
... 11 more



Further notes:
1. I experimented with removing the FOREACH...GENERATE statement where this 
error seems to be occurring. But then, I get the error message:

ERROR 2270: Logical plan invalid state: duplicate uid in schema

2. When I ran the script with the argument-option `-t ColumnMapKeyPrune`, the 
script did successfully run albeit very slowly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3094) ERROR 2229: Couldn't find matching uid -1 in Pig 0.10.0

2012-12-12 Thread Navneet Kapur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navneet Kapur updated PIG-3094:
---

Description: 
I'm getting the error message:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2229: Couldn't find matching uid 
-1 for project (Name: Project Type: bytearray Uid: 2754 Input: 0 Column: 4)

This seems to have been solved for versions 0.8 and 0.9. 
(https://issues.apache.org/jira/browse/PIG-1979)

For privacy reasons, I am unable to post the code here. The stack-trace that I 
get is as follows:

Pig Stack Trace
---
ERROR 2229: Couldn't find matching uid -1 for project (Name: Project Type: 
bytearray Uid: 2754 Input: 0 Column: 4)

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: Error 
processing rule ColumnMapKeyPrune. Try -t ColumnMapKeyPrune
at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:282)
at org.apache.pig.PigServer.compilePp(PigServer.java:1316)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1253)
at org.apache.pig.PigServer.execute(PigServer.java:1245)
at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:555)
at org.apache.pig.Main.main(Main.java:111)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2229: 
Couldn't find matching uid -1 for project (Name: Project Type: bytearray Uid: 
2754 Input: 0 Column: 4)
at 
org.apache.pig.newplan.logical.optimizer.ProjectionPatcher$ProjectionRewriter.visit(ProjectionPatcher.java:91)
at 
org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:207)
at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at 
org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:136)
at 
org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at 
org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:114)
at 
org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:75)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at 
org.apache.pig.newplan.logical.optimizer.ProjectionPatcher.transformed(ProjectionPatcher.java:48)
at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
... 11 more



Further notes:
1. I experimented with removing the FOREACH...GENERATE statement where this 
error seems to be occurring. But then, I get the error message:
   ERROR 2270: Logical plan invalid state: duplicate uid in schema
2. When I ran the script with the argument-option `-t ColumnMapKeyPrune`, the 
script did successfully run albeit very slowly.

  was:
I'm getting the error message:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2229: Couldn't find matching uid 
-1 for project (Name: Project Type: bytearray Uid: 2754 Input: 0 Column: 4)

This seems to have been solved for versions 0.8 and 0.9. 
https://issues.apache.org/jira/browse/PIG-1979

For privacy reasons, I am unable to post the code here. The stack-trace that I 
get is as follows:
Pig Stack Trace
---
ERROR 2229: Couldn't find matching uid -1 for project (Name: Project Type: 
bytearray Uid: 2754 Input: 0 Column: 4)

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: Error 
processing rule ColumnMapKeyPrune. Try -t ColumnMapKeyPrune
at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:282)
at org.apache.pig.PigServer.compilePp(PigServer.java:1316)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1253)
at org.apache.pig.PigServer.execute(PigServer.java:1245)
at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
at 

[jira] [Updated] (PIG-3095) which is called many, many times for each Pig STREAM statement

2012-12-12 Thread Nick White (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick White updated PIG-3095:


Status: Patch Available  (was: Open)

 which is called many, many times for each Pig STREAM statement
 

 Key: PIG-3095
 URL: https://issues.apache.org/jira/browse/PIG-3095
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.12
Reporter: Nick White
Assignee: Nick White
  Labels: patch, performance
 Fix For: 0.12

 Attachments: PIG-3095.patch


 STREAM statements are checked by the LogicalPlanBuilder as it comes across 
 them - and these checks include running the system utility which. However, 
 due to the backtracking parsing mechanism which is called repeatedly with 
 the same arguments (I noticed this while profiling a script with 4 STREAM 
 statements - which was run over 230 times!). The attached patch just caches 
 the return value of which, reducing the overhead of running a system 
 process to a Map lookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3095) which is called many, many times for each Pig STREAM statement

2012-12-12 Thread Nick White (JIRA)
Nick White created PIG-3095:
---

 Summary: which is called many, many times for each Pig STREAM 
statement
 Key: PIG-3095
 URL: https://issues.apache.org/jira/browse/PIG-3095
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.12
Reporter: Nick White
Assignee: Nick White
 Fix For: 0.12
 Attachments: PIG-3095.patch

STREAM statements are checked by the LogicalPlanBuilder as it comes across them 
- and these checks include running the system utility which. However, due to 
the backtracking parsing mechanism which is called repeatedly with the same 
arguments (I noticed this while profiling a script with 4 STREAM statements - 
which was run over 230 times!). The attached patch just caches the return 
value of which, reducing the overhead of running a system process to a Map 
lookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3095) which is called many, many times for each Pig STREAM statement

2012-12-12 Thread Nick White (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick White updated PIG-3095:


Attachment: PIG-3095.patch

 which is called many, many times for each Pig STREAM statement
 

 Key: PIG-3095
 URL: https://issues.apache.org/jira/browse/PIG-3095
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.12
Reporter: Nick White
Assignee: Nick White
  Labels: patch, performance
 Fix For: 0.12

 Attachments: PIG-3095.patch


 STREAM statements are checked by the LogicalPlanBuilder as it comes across 
 them - and these checks include running the system utility which. However, 
 due to the backtracking parsing mechanism which is called repeatedly with 
 the same arguments (I noticed this while profiling a script with 4 STREAM 
 statements - which was run over 230 times!). The attached patch just caches 
 the return value of which, reducing the overhead of running a system 
 process to a Map lookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira