[jira] Created: (PIG-1152) bincond operator throws parser error

2009-12-14 Thread Ankur (JIRA)
bincond operator throws parser error


 Key: PIG-1152
 URL: https://issues.apache.org/jira/browse/PIG-1152
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur


Bincond operator throws parser error when true condition contains a constant 
bag with 1 tuple containing a single field of int type with -ve value. 

Here is the script to reproduce the issue

A = load 'A' as (s: chararray, x: int, y: int);
B = group A by s;
C = foreach B generate group, flatten(((COUNT(A) < 1L) ? {(-1)} : A.x));
dump C;


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790566#action_12790566
 ] 

Hadoop QA commented on PIG-1143:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427980/PIG_1143.patch
  against trunk revision 890553.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/console

This message is automatically generated.

> Poisson Sample Loader should compute the number of samples required only once
> -
>
> Key: PIG-1143
> URL: https://issues.apache.org/jira/browse/PIG-1143
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Sriranjan Manjunath
> Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample 
> number. This is redundant and causes issues when a large directory is 
> specified in the join. The sampler should be changed to calculate the sample 
> count only once and this information should be shared with the remaining 
> mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1149:
---

Status: Patch Available  (was: Open)

Due to the string being parsed a few times along the way, three backslashes 
need to precede the escaped quote in PigLatin. Which means six backslashes when 
expressing PigLatin as a string in Java. 

> Allow instantiation of SampleLoaders with parametrized LoadFuncs
> 
>
> Key: PIG-1149
> URL: https://issues.apache.org/jira/browse/PIG-1149
> Project: Pig
>  Issue Type: Bug
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
>Priority: Minor
> Fix For: 0.7.0
>
> Attachments: pig_1149.patch
>
>
> Currently, it is not possible to instantiate a SampleLoader with something 
> like PigStorage(':').  We should allow passing parameters to the loaders 
> being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1149:
---

Attachment: pig_1149.patch

> Allow instantiation of SampleLoaders with parametrized LoadFuncs
> 
>
> Key: PIG-1149
> URL: https://issues.apache.org/jira/browse/PIG-1149
> Project: Pig
>  Issue Type: Bug
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
>Priority: Minor
> Fix For: 0.7.0
>
> Attachments: pig_1149.patch
>
>
> Currently, it is not possible to instantiate a SampleLoader with something 
> like PigStorage(':').  We should allow passing parameters to the loaders 
> being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790545#action_12790545
 ] 

Ankit Modi commented on PIG-965:


* NonConstantRegex - I did not think of equals. But I added a length check 
before as it could find out change in length faster and to best of my knowledge 
its a getMethod. And yes as you mentioned equals will check for same object and 
instanceOf which is not useful in our case.

* The numbers published above are using dk.brics.automaton.RunAutomaton. Do you 
want me to publish numbers for more set of regexs ?

I'll create a patch for rest of the comments.

> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>Assignee: Ankit Modi
> Attachments: automaton.jar, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Open  (was: Patch Available)

> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>Assignee: Ankit Modi
> Attachments: automaton.jar, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal

2009-12-14 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790529#action_12790529
 ] 

Jeff Zhang commented on PIG-1110:
-

Response to Richard,

1. If you worry about the API compatibility of PigStorage() since PigStorage() 
is the default LoadFunc of Pig,  there's another option that we can provide 
another LoadFunc having the ability of compression, I mean we can create a new 
LoadFunc such as Bz2PigStorage(). 

2. Actually the file name in Store statement is the folder name not the file 
name, we will get part-0.bz2 under this folder. The part-0.bz2 is the 
real file which is consumed by hadoop. Hadoop will check the file name rather 
the folder name to determine the compression codec.



> Handle compressed file formats -- Gz, BZip with the new proposal
> 
>
> Key: PIG-1110
> URL: https://issues.apache.org/jira/browse/PIG-1110
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Richard Ding
>Assignee: Richard Ding
> Attachments: PIG-1110.patch, PIG_1110_Jeff.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1144:


Status: Patch Available  (was: Open)

resubmitting to rerun the tests

> set default_parallelism construct does not set the number of reducers 
> correctly
> ---
>
> Key: PIG-1144
> URL: https://issues.apache.org/jira/browse/PIG-1144
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
> Environment: Hadoop 20 cluster with multi-node installation
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.7.0
>
> Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
> PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set 
> construct: "set default_parallel 100" . I modified the "MRPrinter.java" to 
> printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " 
> Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the 
> actual sort, runs as a single reducer job. This can be corrected, by adding 
> the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1144:


Status: Open  (was: Patch Available)

> set default_parallelism construct does not set the number of reducers 
> correctly
> ---
>
> Key: PIG-1144
> URL: https://issues.apache.org/jira/browse/PIG-1144
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
> Environment: Hadoop 20 cluster with multi-node installation
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.7.0
>
> Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
> PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set 
> construct: "set default_parallel 100" . I modified the "MRPrinter.java" to 
> printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " 
> Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the 
> actual sort, runs as a single reducer job. This can be corrected, by adding 
> the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-973) type resolution inconsistency

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-973:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed. thanks, Richard

> type resolution inconsistency
> -
>
> Key: PIG-973
> URL: https://issues.apache.org/jira/browse/PIG-973
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-973.patch
>
>
> This script works:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>describe F;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
> dump C;
> This one produces an error:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G);}
> dump C;
> Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-973) type resolution inconsistency

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-973:
---

Fix Version/s: 0.7.0

> type resolution inconsistency
> -
>
> Key: PIG-973
> URL: https://issues.apache.org/jira/browse/PIG-973
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-973.patch
>
>
> This script works:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>describe F;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
> dump C;
> This one produces an error:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G);}
> dump C;
> Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1075) Error in Cogroup when key fields types don't match

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1075:
--

Status: Patch Available  (was: Open)

> Error in Cogroup when key fields types don't match
> --
>
> Key: PIG-1075
> URL: https://issues.apache.org/jira/browse/PIG-1075
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Ankur
>Assignee: Richard Ding
> Attachments: PIG-1075.patch
>
>
> When Cogrouping 2 relations on multiple key fields, pig throws an error if 
> the corresponding types don't match. 
> Consider the following script:-
> A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int);
> B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int);
> C = CoGROUP A BY (a,b,c), B BY (a,b,c);
> D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
> describe D;
> dump D;
> The complete stack trace of the error thrown is
> Pig Stack Trace
> ---
> ERROR 1051: Cannot cast to Unknown
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to 
> describe schema for alias D
> at org.apache.pig.PigServer.dumpSchema(PigServer.java:436)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
> unexpected exception caused the validation to stop
> at 
> org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
> at 
> org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
> at org.apache.pig.PigServer.compileLp(PigServer.java:821)
> at org.apache.pig.PigServer.dumpSchema(PigServer.java:428)
> ... 6 more
> Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
> ERROR 1060: Cannot resolve COGroup output schema
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463)
> at 
> org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372)
> at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
> ... 11 more
> Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
> ERROR 1051: Cannot cast to Unknown
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451)
> ... 16 more
> The error message does not help the user in identifying the issue clearly 
> especially if the pig script is large and complex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1075) Error in Cogroup when key fields types don't match

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1075:
-

Assignee: Richard Ding

> Error in Cogroup when key fields types don't match
> --
>
> Key: PIG-1075
> URL: https://issues.apache.org/jira/browse/PIG-1075
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Ankur
>Assignee: Richard Ding
> Attachments: PIG-1075.patch
>
>
> When Cogrouping 2 relations on multiple key fields, pig throws an error if 
> the corresponding types don't match. 
> Consider the following script:-
> A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int);
> B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int);
> C = CoGROUP A BY (a,b,c), B BY (a,b,c);
> D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
> describe D;
> dump D;
> The complete stack trace of the error thrown is
> Pig Stack Trace
> ---
> ERROR 1051: Cannot cast to Unknown
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to 
> describe schema for alias D
> at org.apache.pig.PigServer.dumpSchema(PigServer.java:436)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
> unexpected exception caused the validation to stop
> at 
> org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
> at 
> org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
> at org.apache.pig.PigServer.compileLp(PigServer.java:821)
> at org.apache.pig.PigServer.dumpSchema(PigServer.java:428)
> ... 6 more
> Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
> ERROR 1060: Cannot resolve COGroup output schema
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463)
> at 
> org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372)
> at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
> ... 11 more
> Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
> ERROR 1051: Cannot cast to Unknown
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451)
> ... 16 more
> The error message does not help the user in identifying the issue clearly 
> especially if the pig script is large and complex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1075) Error in Cogroup when key fields types don't match

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1075:
--

Attachment: PIG-1075.patch

This patch moves the error up to the parser and gives a better error message 
for cogroup statement with incompatible group types:

Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1110: 
Cogroup column 1 has incompatible types: chararray versus int
at 
org.apache.pig.impl.logicalLayer.LOCogroup.getTupleGroupBySchema(LOCogroup.java:499)
at 
org.apache.pig.impl.logicalLayer.LOCogroup.getSchema(LOCogroup.java:325)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:779)

> Error in Cogroup when key fields types don't match
> --
>
> Key: PIG-1075
> URL: https://issues.apache.org/jira/browse/PIG-1075
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Ankur
> Attachments: PIG-1075.patch
>
>
> When Cogrouping 2 relations on multiple key fields, pig throws an error if 
> the corresponding types don't match. 
> Consider the following script:-
> A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int);
> B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int);
> C = CoGROUP A BY (a,b,c), B BY (a,b,c);
> D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
> describe D;
> dump D;
> The complete stack trace of the error thrown is
> Pig Stack Trace
> ---
> ERROR 1051: Cannot cast to Unknown
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to 
> describe schema for alias D
> at org.apache.pig.PigServer.dumpSchema(PigServer.java:436)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
> unexpected exception caused the validation to stop
> at 
> org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
> at 
> org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
> at org.apache.pig.PigServer.compileLp(PigServer.java:821)
> at org.apache.pig.PigServer.dumpSchema(PigServer.java:428)
> ... 6 more
> Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
> ERROR 1060: Cannot resolve COGroup output schema
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463)
> at 
> org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372)
> at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
> ... 11 more
> Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
> ERROR 1051: Cannot cast to Unknown
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552)
> at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451)
> ... 16 more
> The error message does not help the user in identifying the issue clearly 
> especially if the pig script is large and complex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1151) Date Conversion + Arithmetic UDFs

2009-12-14 Thread sam rash (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash updated PIG-1151:
--

Summary: Date Conversion + Arithmetic UDFs  (was: Data Conversion + 
Arithmetic UDFs)

> Date Conversion + Arithmetic UDFs
> -
>
> Key: PIG-1151
> URL: https://issues.apache.org/jira/browse/PIG-1151
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.5.0
>Reporter: sam rash
>Priority: Minor
>
> I would like to offer up some very simple data UDFs I have that wrap JodaTime 
> (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and 
> operate on ISO8601 date strings.
> (for piggybank).  Please advise if these are appropriate.
> 1. Date Arithmetic
> takes an input string: 
> 2009-01-01T13:43:33.000Z
> (and partial ones such as 2009-01-02)
> and a timespan (as millis or as string shorthand)
> returns an ISO8601 string that adjusts the input date by the specified 
> timespan
> DatePlus(long timeMs); // + or - number works, is the # of millis
> DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc
> DateMinus(String timespan); //propose explicit minus when using string 
> shorthand for time periods
> 2. Date Comparison (when you don't have full strings that you can use string 
> compare with):
> DateIsBefore(String dateString); //true if lhs is before rhs
> DateIsAfter(String dateString); //true if lsh is after rhs
> 3. date trunc functions:
> takes partial ISO8601 strings and truncates to:
> toMinute(String dateString);
> toHour(String dateString);
> toDay(String dateString);
> toWeek(String dateString);
> toMonth(String dateString);
> toYear(String dateString);
> if any/all are helpful, I'm happy to contribute to pig

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-973) type resolution inconsistency

2009-12-14 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790494#action_12790494
 ] 

Olga Natkovich commented on PIG-973:


+1 on the changes. I will be committing the patch shortly

> type resolution inconsistency
> -
>
> Key: PIG-973
> URL: https://issues.apache.org/jira/browse/PIG-973
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-973.patch
>
>
> This script works:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>describe F;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
> dump C;
> This one produces an error:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G);}
> dump C;
> Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1082) Modify Comparator to work with a typed textual Storage

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1082:


Fix Version/s: (was: 0.0.0)

> Modify Comparator to work with a typed textual Storage
> --
>
> Key: PIG-1082
> URL: https://issues.apache.org/jira/browse/PIG-1082
> Project: Pig
>  Issue Type: Sub-task
>Affects Versions: 0.4.0
>Reporter: hc busy
> Attachments: PIG-1082.patch
>
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> See parent bug. This ticket is for just the comparator change, which needs to 
> be made in order for the nested data structures to sort right

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790492#action_12790492
 ] 

Thejas M Nair commented on PIG-965:
---

Reviewed the latest patch.
Comments :
* RegexInit.java, in determineBestRegexMethod Line 85 - 120
There are while loops where we are testing only for preceding '\'
The handling of preceding escapes could be done in a separate function, since 
the logic is used at multiple places

* RegexInit.java lines 61,147
// This is the case when an old number of escapes 
I believe you meant "odd"

* RegexImpl.java - following comments are not relevant anymore
+// LHS means LHS is constantExpression and RHS varies with each Tuple
+// RHS means RHS is constantExpression and LHS varies with each Tuple

* NonConstantRegex , line 34-35
{code}
|| rhs.length() != oldString.length()
|| rhs.compareTo(oldString) != 0
{code}
could be simplified as -
{code}
|| !rhs.equals(oldString)
{code}
Did you chose the former because it might be faster ? That can be the case in 
this situation, because equals has a additional check of - "instanceOf String" 
.  So I think the existing code is fine. A comment there might be useful.


Can you also publish your numbers for the comparison of 
dk.brics.automaton.RunAutomaton and optimization 2 (Use string comparisons for 
simple common regexes ) in the jira ?



> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>Assignee: Ankit Modi
> Attachments: automaton.jar, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1143:
-

Status: Patch Available  (was: Open)

> Poisson Sample Loader should compute the number of samples required only once
> -
>
> Key: PIG-1143
> URL: https://issues.apache.org/jira/browse/PIG-1143
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Sriranjan Manjunath
> Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample 
> number. This is redundant and causes issues when a large directory is 
> specified in the join. The sampler should be changed to calculate the sample 
> count only once and this information should be shared with the remaining 
> mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1143:
-

Attachment: PIG_1143.patch

> Poisson Sample Loader should compute the number of samples required only once
> -
>
> Key: PIG-1143
> URL: https://issues.apache.org/jira/browse/PIG-1143
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Sriranjan Manjunath
> Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample 
> number. This is redundant and causes issues when a large directory is 
> specified in the join. The sampler should be changed to calculate the sample 
> count only once and this information should be shared with the remaining 
> mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1054) Pig Site - updates for 5.0

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1054:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

The changes has already been applied.

> Pig Site - updates for 5.0
> --
>
> Key: PIG-1054
> URL: https://issues.apache.org/jira/browse/PIG-1054
> Project: Pig
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 0.5.0
>Reporter: Corinne Chandel
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: pig-1054.patch
>
>
> Pig Site - updates for 5.0
> > remove broken link
> > update formatting for headers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1016:


Fix Version/s: (was: 0.5.0)
   0.7.0

> Reading in map data seems broken
> 
>
> Key: PIG-1016
> URL: https://issues.apache.org/jira/browse/PIG-1016
> Project: Pig
>  Issue Type: Improvement
>  Components: data
>Affects Versions: 0.4.0
>Reporter: hc busy
> Fix For: 0.7.0
>
> Attachments: PIG-1016.patch
>
>
> Hi, I'm trying to load a map that has a tuple for value. The read fails in 
> 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
> documentation it is stated that value of the map can be any time.
> I've attached a patch that allows us to read in complex objects as value as 
> documented. I've done simple verification of loading in maps with tuple/map 
> values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1106) FR join should not spill

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1106:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed. Thanks, Ankit!

> FR join should not spill
> 
>
> Key: PIG-1106
> URL: https://issues.apache.org/jira/browse/PIG-1106
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Ankit Modi
> Fix For: 0.7.0
>
> Attachments: frjoin-nonspill.patch
>
>
> Currently, the values for the replicated side of the data are placed in a 
> spillable bag (POFRJoin near line 275). This does not make sense because the 
> whole point of the optimization is that the data on one side fits into 
> memory. We already have a non-spillable bag implemented 
> (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
> of course need to do lots of testing to make sure that we don't spill but die 
> instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1151) Data Conversion + Arithmetic UDFs

2009-12-14 Thread sam rash (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash updated PIG-1151:
--

Priority: Minor  (was: Major)

> Data Conversion + Arithmetic UDFs
> -
>
> Key: PIG-1151
> URL: https://issues.apache.org/jira/browse/PIG-1151
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.5.0
>Reporter: sam rash
>Priority: Minor
>
> I would like to offer up some very simple data UDFs I have that wrap JodaTime 
> (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and 
> operate on ISO8601 date strings.
> (for piggybank).  Please advise if these are appropriate.
> 1. Date Arithmetic
> takes an input string: 
> 2009-01-01T13:43:33.000Z
> (and partial ones such as 2009-01-02)
> and a timespan (as millis or as string shorthand)
> returns an ISO8601 string that adjusts the input date by the specified 
> timespan
> DatePlus(long timeMs); // + or - number works, is the # of millis
> DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc
> DateMinus(String timespan); //propose explicit minus when using string 
> shorthand for time periods
> 2. Date Comparison (when you don't have full strings that you can use string 
> compare with):
> DateIsBefore(String dateString); //true if lhs is before rhs
> DateIsAfter(String dateString); //true if lsh is after rhs
> 3. date trunc functions:
> takes partial ISO8601 strings and truncates to:
> toMinute(String dateString);
> toHour(String dateString);
> toDay(String dateString);
> toWeek(String dateString);
> toMonth(String dateString);
> toYear(String dateString);
> if any/all are helpful, I'm happy to contribute to pig

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1151) Data Conversion + Arithmetic UDFs

2009-12-14 Thread sam rash (JIRA)
Data Conversion + Arithmetic UDFs
-

 Key: PIG-1151
 URL: https://issues.apache.org/jira/browse/PIG-1151
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
Reporter: sam rash


I would like to offer up some very simple data UDFs I have that wrap JodaTime 
(apache 2.0 license, http://joda-time.sourceforge.net/license.html) and operate 
on ISO8601 date strings.
(for piggybank).  Please advise if these are appropriate.

1. Date Arithmetic

takes an input string: 

2009-01-01T13:43:33.000Z
(and partial ones such as 2009-01-02)

and a timespan (as millis or as string shorthand)

returns an ISO8601 string that adjusts the input date by the specified timespan

DatePlus(long timeMs); // + or - number works, is the # of millis
DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc
DateMinus(String timespan); //propose explicit minus when using string 
shorthand for time periods

2. Date Comparison (when you don't have full strings that you can use string 
compare with):

DateIsBefore(String dateString); //true if lhs is before rhs
DateIsAfter(String dateString); //true if lsh is after rhs

3. date trunc functions:

takes partial ISO8601 strings and truncates to:

toMinute(String dateString);
toHour(String dateString);
toDay(String dateString);
toWeek(String dateString);
toMonth(String dateString);
toYear(String dateString);

if any/all are helpful, I'm happy to contribute to pig

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1150) VAR() Variance UDF

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1150:


Fix Version/s: (was: 0.5.0)
   0.7.0

Updating the fix version since it will go into the future version and will not 
be backported

> VAR() Variance UDF
> --
>
> Key: PIG-1150
> URL: https://issues.apache.org/jira/browse/PIG-1150
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.5.0
> Environment: UDF, written in Pig 0.5 contrib/
>Reporter: Russell Jurney
> Fix For: 0.7.0
>
>
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
> variance in a distributed manner, based on the AVG() builtin.  It works by 
> calculating the count, sum and sum of squares, as described here: 
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
> Is this a worthwhile contribution?  Taking the square root of this value 
> using the contrib SQRT() function gives Standard Deviation, which is missing 
> from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2009-12-14 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790403#action_12790403
 ] 

Olga Natkovich commented on PIG-1150:
-

Yes, it is definitely worse while to contribute!

> VAR() Variance UDF
> --
>
> Key: PIG-1150
> URL: https://issues.apache.org/jira/browse/PIG-1150
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.5.0
> Environment: UDF, written in Pig 0.5 contrib/
>Reporter: Russell Jurney
> Fix For: 0.7.0
>
>
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
> variance in a distributed manner, based on the AVG() builtin.  It works by 
> calculating the count, sum and sum of squares, as described here: 
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
> Is this a worthwhile contribution?  Taking the square root of this value 
> using the contrib SQRT() function gives Standard Deviation, which is missing 
> from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-973) type resolution inconsistency

2009-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790402#action_12790402
 ] 

Hadoop QA commented on PIG-973:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427943/PIG-973.patch
  against trunk revision 889870.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 7 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/console

This message is automatically generated.

> type resolution inconsistency
> -
>
> Key: PIG-973
> URL: https://issues.apache.org/jira/browse/PIG-973
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-973.patch
>
>
> This script works:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>describe F;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
> dump C;
> This one produces an error:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G);}
> dump C;
> Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1150) VAR() Variance UDF

2009-12-14 Thread Russell Jurney (JIRA)
VAR() Variance UDF
--

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
 Fix For: 0.5.0


I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
variance in a distributed manner, based on the AVG() builtin.  It works by 
calculating the count, sum and sum of squares, as described here: 
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

Is this a worthwhile contribution?  Taking the square root of this value using 
the contrib SQRT() function gives Standard Deviation, which is missing from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces

2009-12-14 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790387#action_12790387
 ] 

Pradeep Kamath commented on PIG-1090:
-

I committed the latest patch - thanks Richard! Looks like I committed it while 
the above discussion was still on - if things need to be changed, please attach 
a small patch for the same and I can commit it - if we decide to keep things 
the way they are now that's fine.

> Update sources to reflect recent changes in load-store interfaces
> -
>
> Key: PIG-1090
> URL: https://issues.apache.org/jira/browse/PIG-1090
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch
>
>
> There have been some changes (as recorded in the Changes Section, Nov 2 2009 
> sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the 
> load/store interfaces - this jira is to track the task of making those 
> changes under src. Changes under test will be addresses in a different jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)
Allow instantiation of SampleLoaders with parametrized LoadFuncs


 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0


Currently, it is not possible to instantiate a SampleLoader with something like 
PigStorage(':').  We should allow passing parameters to the loaders being 
sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal

2009-12-14 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790362#action_12790362
 ] 

Richard Ding commented on PIG-1110:
---

Hi Jeff, I think it's a good idea to ask users to specify their intension in 
PigStorage constructor (instead using file extensions). The issue with this 
approach, however,  is that the arguments to PigStorage constructors can only 
be Strings so Pig determines the meanings of the arguments by their positions. 
Therefore we want to consider carefully what other arguments needed to add to 
the constructor in the future and what're their positions.

As for foring users to add .bz2 as the extension of the output files, this is 
actually necessary since Hadoop LineRecordReader (used internally by 
PigStorage) finds the relevant compression codec for the given file based on 
its filename suffix. So for now users must specify .bz2 as the extension of the 
output files if they want to store the files as BZip files.

> Handle compressed file formats -- Gz, BZip with the new proposal
> 
>
> Key: PIG-1110
> URL: https://issues.apache.org/jira/browse/PIG-1110
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Richard Ding
>Assignee: Richard Ding
> Attachments: PIG-1110.patch, PIG_1110_Jeff.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790356#action_12790356
 ] 

Dmitriy V. Ryaboy commented on PIG-1090:


Richard, 
I added the getters/setters so that ResourceSchema can be treated as a POJO, 
and standard serialization tools can easily interact with it, in PIG-760. 
Alan said in PIG-760 that he is fine with adding getters and setters, but feels 
strongly that direct access to these members should still be allowed, for 
simplicity's sake.
I'm fine with the visibility being either way, as long as the getters/setters 
stay in (although perhaps protected would be a better choice than private).


> Update sources to reflect recent changes in load-store interfaces
> -
>
> Key: PIG-1090
> URL: https://issues.apache.org/jira/browse/PIG-1090
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch
>
>
> There have been some changes (as recorded in the Changes Section, Nov 2 2009 
> sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the 
> load/store interfaces - this jira is to track the task of making those 
> changes under src. Changes under test will be addresses in a different jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces

2009-12-14 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790334#action_12790334
 ] 

Richard Ding commented on PIG-1090:
---

The problem is that the getters/setters for the internal members are also 
defined for ResourceSchema. I felt that we should choose one way to access the 
internal members.  

> Update sources to reflect recent changes in load-store interfaces
> -
>
> Key: PIG-1090
> URL: https://issues.apache.org/jira/browse/PIG-1090
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch
>
>
> There have been some changes (as recorded in the Changes Section, Nov 2 2009 
> sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the 
> load/store interfaces - this jira is to track the task of making those 
> changes under src. Changes under test will be addresses in a different jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790313#action_12790313
 ] 

Dmitriy V. Ryaboy commented on PIG-1090:


I thought Alan wanted to keep the internal state of ResourceSchema public?
This patch changes the visibility of internal members to private. 

> Update sources to reflect recent changes in load-store interfaces
> -
>
> Key: PIG-1090
> URL: https://issues.apache.org/jira/browse/PIG-1090
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch
>
>
> There have been some changes (as recorded in the Changes Section, Nov 2 2009 
> sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the 
> load/store interfaces - this jira is to track the task of making those 
> changes under src. Changes under test will be addresses in a different jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1148) Move splitable logic from pig latin to InputFormat

2009-12-14 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790273#action_12790273
 ] 

Pradeep Kamath commented on PIG-1148:
-

Hi Jeff,
  With the new load store redesign 
(http://wiki.apache.org/pig/LoadStoreRedesignProposal) wouldn't this be 
achieved implcitly since the splits used by pig will be the ones returned from 
the InputFormat associated with the Loader. The plan was to remove SPLIT by 
'file' from the language since with the new load-store design it will not be 
possible to support this from pig - So there will be no splitable logic left 
with that approach - where you thinking of some other way to support split by 
file?

> Move splitable logic from pig latin to InputFormat
> --
>
> Key: PIG-1148
> URL: https://issues.apache.org/jira/browse/PIG-1148
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-973) type resolution inconsistency

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-973:
-

Status: Patch Available  (was: Open)

> type resolution inconsistency
> -
>
> Key: PIG-973
> URL: https://issues.apache.org/jira/browse/PIG-973
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-973.patch
>
>
> This script works:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>describe F;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
> dump C;
> This one produces an error:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G);}
> dump C;
> Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-973) type resolution inconsistency

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-973:
-

Attachment: PIG-973.patch

> type resolution inconsistency
> -
>
> Key: PIG-973
> URL: https://issues.apache.org/jira/browse/PIG-973
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-973.patch
>
>
> This script works:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>describe F;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
> dump C;
> This one produces an error:
> A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
> float);
> B = group A by age;
> C = foreach B {
>D = filter A by gpa > 2.5;
>E = order A by name;
>F = A.age;
>G = distinct F;
>generate group, COUNT(D), MAX (E.name), MIN(G);}
> dump C;
> Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790104#action_12790104
 ] 

Hadoop QA commented on PIG-965:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427913/poregex2.patch
  against trunk revision 889870.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to cause Findbugs to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/testReport/
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/console

This message is automatically generated.

> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>Assignee: Ankit Modi
> Attachments: automaton.jar, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Patch Available  (was: Open)

I have included changes suggested by Thejas.

> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>Assignee: Ankit Modi
> Attachments: automaton.jar, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: poregex2.patch)

> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>Assignee: Ankit Modi
> Attachments: automaton.jar, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Open  (was: Patch Available)

> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>Assignee: Ankit Modi
> Attachments: automaton.jar, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: poregex2.patch

> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>Assignee: Ankit Modi
> Attachments: automaton.jar, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.