[jira] Commented: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug
[ https://issues.apache.org/jira/browse/PIG-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915860#action_12915860 ] Olga Natkovich commented on PIG-1652: - I think the code needs to be modified to default to 1 if we can't perform the computation TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug Key: PIG-1652 URL: https://issues.apache.org/jira/browse/PIG-1652 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to the input size estimation. Here is the stack of TestSortedTableUnionMergeJoin: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias records3 at org.apache.pig.PigServer.storeEx(PigServer.java:877) at org.apache.pig.PigServer.store(PigServer.java:815) at org.apache.pig.PigServer.openIterator(PigServer.java:727) at org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197) at org.apache.pig.PigServer.storeEx(PigServer.java:873) Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 69: org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file: at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:126) at org.apache.hadoop.fs.Path.init(Path.java:50) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301) Caused by: java.net.URISyntaxException: Illegal
[jira] Assigned: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug
[ https://issues.apache.org/jira/browse/PIG-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1652: --- Assignee: Thejas M Nair TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug Key: PIG-1652 URL: https://issues.apache.org/jira/browse/PIG-1652 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Thejas M Nair Fix For: 0.8.0 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to the input size estimation. Here is the stack of TestSortedTableUnionMergeJoin: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias records3 at org.apache.pig.PigServer.storeEx(PigServer.java:877) at org.apache.pig.PigServer.store(PigServer.java:815) at org.apache.pig.PigServer.openIterator(PigServer.java:727) at org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197) at org.apache.pig.PigServer.storeEx(PigServer.java:873) Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 69: org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file: at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:126) at org.apache.hadoop.fs.Path.init(Path.java:50) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301) Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 69:
[jira] Resolved: (PIG-1646) Error meassage for pig root directory does not existcab be more meaningful
[ https://issues.apache.org/jira/browse/PIG-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1646. - Resolution: Invalid this ticket is for particular deployment scenerio - it has nothing to do with core pig functionality. Error meassage for pig root directory does not existcab be more meaningful Key: PIG-1646 URL: https://issues.apache.org/jira/browse/PIG-1646 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Sherry Chen Priority: Minor Currently, the error message for pig root directory does not exist is: * You suppose to use /grid/0/gs/pig/0.8 as pig root directory, however, symlink /grid/0/gs/pig/0.8 does not exist It can be corrected as: * Pig root directory should be /grid/0/gs/pig/0.8, however, symlink /grid/0/gs/pig/0.8 does not exist Steps to test: 1. submit a pig job: pig -useversion 0.8 -exectype local local.pig 2. Read the error message -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1600) Pig 080 Documentation
[ https://issues.apache.org/jira/browse/PIG-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914731#action_12914731 ] Olga Natkovich commented on PIG-1600: - patch committed to the trank and 0.8 branch. Thanks, Corinne Pig 080 Documentation - Key: PIG-1600 URL: https://issues.apache.org/jira/browse/PIG-1600 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.8.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.8.0 Attachments: pig080-1.patch, pig080-2-2.patch, pig080-2.patch, pig080-3.patch Pig 080 documentation - new features, updates, an fixes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1504) need to document new functions moved from piggybank to builtin
[ https://issues.apache.org/jira/browse/PIG-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1504. - Resolution: Fixed need to document new functions moved from piggybank to builtin -- Key: PIG-1504 URL: https://issues.apache.org/jira/browse/PIG-1504 Project: Pig Issue Type: Improvement Components: documentation Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 We need to document the following new functions: ABS ACOS ASIN ATAN CBRT CEIL COR COSH COS COV EXP FLOOR INDEXOF LAST_INDEX_OF LCFIRST LOG10 LOG LOWER RANDOM REGEX_EXTRACT_ALL REGEX_EXTRACT REPLACE ROUND SINH SIN SPLIT SQRT SUBSTRING TANH TAN TOBAG TOP TOTUPLE TRIM UCFIRST UPPER Large part of them are math function and descriptions can be found here: http://download.oracle.com/docs/cd/E17409_01/javase/7/docs/api/java/lang/Math.html Dor the rest, we would need to provide descriptions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1632) The core jar in the tarball contains the kitchen sink
[ https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1632: Status: Resolved (was: Patch Available) Resolution: Fixed The core jar in the tarball contains the kitchen sink -- Key: PIG-1632 URL: https://issues.apache.org/jira/browse/PIG-1632 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0, 0.9.0 Reporter: Eli Collins Assignee: Eli Collins Fix For: site, 0.9.0 Attachments: pig-1632-1.patch, pig-1632-2.patch The core jar in the tarball contains the kitchen sink, it's not the same core jar built by ant jar. This is problematic since other projects that want to depend on the pig core jar just want pig core, but pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff (hadoop, com.google, commons, etc) that may conflict with the packages also on a user's classpath. {noformat} pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 12 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz ... pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 4819 {noformat} How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1632) The core jar in the tarball contains the kitchen sink
[ https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913733#action_12913733 ] Olga Natkovich commented on PIG-1632: - Hi Eli, thanks for the patch. I don't think this is the approach we want to take. I think we should publish just core pig jar in maven since users have a way to pull the dependencies. However, as part of our release package we should include bundled pig.jar so that it works for users out of the box and they get exactly the version we have been testing for. I am fine if additionally we include the core jar as well if we do not do this already. The core jar in the tarball contains the kitchen sink -- Key: PIG-1632 URL: https://issues.apache.org/jira/browse/PIG-1632 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0, 0.9.0 Reporter: Eli Collins Fix For: site, 0.9.0 Attachments: pig-1632-1.patch The core jar in the tarball contains the kitchen sink, it's not the same core jar built by ant jar. This is problematic since other projects that want to depend on the pig core jar just want pig core, but pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff (hadoop, com.google, commons, etc) that may conflict with the packages also on a user's classpath. {noformat} pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 12 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz ... pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 4819 {noformat} How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1632) The core jar in the tarball contains the kitchen sink
[ https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913743#action_12913743 ] Olga Natkovich commented on PIG-1632: - I am fine with your second proposal which is what I also suggested in my last comment. The first one makes it harder for the users to compile their UDFs The core jar in the tarball contains the kitchen sink -- Key: PIG-1632 URL: https://issues.apache.org/jira/browse/PIG-1632 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0, 0.9.0 Reporter: Eli Collins Fix For: site, 0.9.0 Attachments: pig-1632-1.patch The core jar in the tarball contains the kitchen sink, it's not the same core jar built by ant jar. This is problematic since other projects that want to depend on the pig core jar just want pig core, but pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff (hadoop, com.google, commons, etc) that may conflict with the packages also on a user's classpath. {noformat} pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 12 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz ... pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 4819 {noformat} How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1632) The core jar in the tarball contains the kitchen sink
[ https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913759#action_12913759 ] Olga Natkovich commented on PIG-1632: - + 1, patch looks good. I will commit it to trunk and 0.8 branch shortly The core jar in the tarball contains the kitchen sink -- Key: PIG-1632 URL: https://issues.apache.org/jira/browse/PIG-1632 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0, 0.9.0 Reporter: Eli Collins Fix For: site, 0.9.0 Attachments: pig-1632-1.patch, pig-1632-2.patch The core jar in the tarball contains the kitchen sink, it's not the same core jar built by ant jar. This is problematic since other projects that want to depend on the pig core jar just want pig core, but pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff (hadoop, com.google, commons, etc) that may conflict with the packages also on a user's classpath. {noformat} pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 12 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz ... pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 4819 {noformat} How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1632) The core jar in the tarball contains the kitchen sink
[ https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913788#action_12913788 ] Olga Natkovich commented on PIG-1632: - patch committed to both 0.8 branch and trunk. Thanks, Eli for contributing! The core jar in the tarball contains the kitchen sink -- Key: PIG-1632 URL: https://issues.apache.org/jira/browse/PIG-1632 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0, 0.9.0 Reporter: Eli Collins Fix For: site, 0.9.0 Attachments: pig-1632-1.patch, pig-1632-2.patch The core jar in the tarball contains the kitchen sink, it's not the same core jar built by ant jar. This is problematic since other projects that want to depend on the pig core jar just want pig core, but pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff (hadoop, com.google, commons, etc) that may conflict with the packages also on a user's classpath. {noformat} pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 12 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz ... pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 4819 {noformat} How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1635: Fix Version/s: 0.8.0 Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed Key: PIG-1635 URL: https://issues.apache.org/jira/browse/PIG-1635 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor Fix For: 0.8.0 b = FILTER a by (( f1 1) AND (1 == 1)) or b = FILTER a by ((f1 1) OR ( 1==0)) should be simplified to b = FILTER a by f1 1; Regarding ordering change, an example is that b = filter a by ((f1 is not null) AND (f2 is not null)); Even without possible simplification, the expression is changed to b = filter a by ((f2 is not null) AND (f1 is not null)); Even though the ordering change in this case, and probably in most other cases, does not create any difference, but for two reasons some users might care about the ordering: if stateful UDFs are used as operands of AND or OR; and if the ordering is intended by the application designer to maximize the chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1339: Fix Version/s: 0.9.0 We should see if the new parser makes this easier and if so fix it. International characters in column names not supported -- Key: PIG-1339 URL: https://issues.apache.org/jira/browse/PIG-1339 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0, 0.7.0, 0.8.0 Reporter: Viraj Bhat Fix For: 0.9.0 There is a particular use-case in which someone specifies a column name to be in International characters. {code} inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); describe inputdata; dump inputdata; {code} == Pig Stack Trace --- ERROR 1000: Error during parsing. Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, column 64. Encountered: \u3042 (12354), after : at org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) == Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1640) bin/pig does not run in local mode due to classes missing from classpath
bin/pig does not run in local mode due to classes missing from classpath Key: PIG-1640 URL: https://issues.apache.org/jira/browse/PIG-1640 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Fix For: 0.8.0 This issue was reported by one of Yahoo users. I have not verified the problem. Here is the report when do bin/pig -x local, the shell doesn't come up. It complained about jline not being found. Here is a patch to bin/pig: +for f in $PIG_HOME/build/ivy/lib/Pig/*.jar; do +CLASSPATH=${CLASSPATH}:$f; +done + -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1639: Assignee: Xuefu Zhang (was: Daniel Dai) New logical plan: PushUpFilter should not optimize if filter condition contains UDF --- Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) 1; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1617) 'group all' should always use one reducer
[ https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912655#action_12912655 ] Olga Natkovich commented on PIG-1617: - Looks good. +1 'group all' should always use one reducer - Key: PIG-1617 URL: https://issues.apache.org/jira/browse/PIG-1617 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1617.1.patch 'group all' sends all rows to a single reducer, it does not make sense to spawn more than one reducer for it. But if higher value of parallelism is specified or if the input is large enough so that changes in PIG-1249 result in larger value being set, there are additional reducers spawned that don't do anything useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
[ https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1579: --- Assignee: Daniel Dai Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput --- Key: PIG-1579 URL: https://issues.apache.org/jira/browse/PIG-1579 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1579-1.patch Error message: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function: Traceback (most recent call last): File iostream, line 5, in multStr TypeError: can't multiply sequence by non-int of type 'NoneType' at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1624) FOREACH AS documentation is incorrect
[ https://issues.apache.org/jira/browse/PIG-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1624: Fix Version/s: 0.8.0 (was: 0.9.0) We are still updating docs so we should be able to get this in for 0.8 FOREACH AS documentation is incorrect - Key: PIG-1624 URL: https://issues.apache.org/jira/browse/PIG-1624 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Corinne Chandel Fix For: 0.8.0 According to the Pig Latin manual (http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#FOREACH) the correct usage of AS in a FOREACH clause is: {code} B = foreach A generate $0, $1, $2 as (user, age, gpa); {code} However, this is incorrect, and produce a syntax error. The correct syntax for AS for FOREACH is: {code} B = foreach A generate $0 as user, $1 as age, $2 as gpa; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1626) Need to clarify how COUNT handles nulls
Need to clarify how COUNT handles nulls --- Key: PIG-1626 URL: https://issues.apache.org/jira/browse/PIG-1626 Project: Pig Issue Type: Bug Components: documentation Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 The current documentation just states: The COUNT function ignores NULL values. If you want to include NULL values in the count computation, use COUNT_STAR. The new text should be something like The COUNT function follows syntax semantics and ignores nulls. What this means is that a tuple in the bag will not be counted if the first field in this tuple is NULL. If you want to include NULL values in the count computation, use COUNT_STAR. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1629) Need ability to limit bags produced during GROUP + LIMIT
Need ability to limit bags produced during GROUP + LIMIT Key: PIG-1629 URL: https://issues.apache.org/jira/browse/PIG-1629 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Thejas M Nair Fix For: 0.9.0 Currently, the code below will construct the full group in memory and then trim it. This requires in use of more memory than needed. A = load 'data' as (x, y, z); B = group A by x; C = foreach B{ D = limit A 100; generate group, MyUDF(D);} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag
[ https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1615. - Resolution: Fixed Return code from Pig is 0 even if the job fails when using -M flag -- Key: PIG-1615 URL: https://issues.apache.org/jira/browse/PIG-1615 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script of this form, which I used inside a workflow system such as Oozie. {code} A = load '$INPUT' using PigStorage(); store A into '$OUTPUT'; {code} I run this as with Multi-query optimization turned off : {quote} $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig {quote} The directory /user/viraj/junk1 is not present I get the following results: {quote} Input(s): Failed to read data from /user/viraj/junk1 Output(s): Failed to produce result in /user/viraj/junk2 {quote} This is expected, but the return code is still 0 {code} $ echo $? 0 {code} If I run this script with Multi-query optimization turned on, it gives, a return code of 2, which is correct. {code} $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig ... $ echo $? 2 {code} I believe a wrong return code from Pig, is causing Oozie to believe that Pig script succeeded. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error
[ https://issues.apache.org/jira/browse/PIG-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1247: --- Assignee: Xuefu Zhang Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error - Key: PIG-1247 URL: https://issues.apache.org/jira/browse/PIG-1247 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Xuefu Zhang Fix For: 0.9.0 I have a large script in which there are intermediate stores statements, one of them writes to a directory I do not have permission to write to. The stack trace I get from Pig is this: 2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error Details at logfile: /home/viraj/pig_1266632145355.log Pig Stack Trace --- ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error java.lang.ClassCastException: org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986) at org.apache.pig.PigServer.registerQuery(PigServer.java:386) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:386) The only way to find the error was to look at the javacc generated QueryParser.java code and do a System.out.println() Here is a script to reproduce the problem: {code} A = load '/user/viraj/three.txt' using PigStorage(); B = foreach A generate ['a'#'12'] as b:map[] ; store B into '/user/secure/pigtest' using PigStorage(); {code} three.txt has 3 lines which contain nothing but the number 1. {code} $ hadoop fs -ls /user/secure/ ls: could not get get listing for 'hdfs://mynamenode/user/secure' : org.apache.hadoop.security.AccessControlException: Permission denied: user=viraj, access=READ_EXECUTE, inode=secure:secure:users:rwx-- {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key
[ https://issues.apache.org/jira/browse/PIG-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1592: --- Assignee: Thejas M Nair ORDER BY distribution is uneven when record size is correlated with order key - Key: PIG-1592 URL: https://issues.apache.org/jira/browse/PIG-1592 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Thejas M Nair Fix For: 0.9.0 The partitioner contributed in PIG-545 distributes the order key space between partitions so that each partition gets approximately the same number of keys, even when the keys have a non-uniform distribution over the key space. Unfortunately this still allows for severe partition imbalance when record size is correlated with the order key. By way of motivating example, consider this script which attempts to produce a list of genuses based on how many species each genus contains: {code} set default_parallel 60; critters = load 'biodata'' as (genus, species); genus_counts = foreach (group critters by genus) generate group as genus, COUNT(critters) as num_species, critters; ordered_genuses = order genus_counts by num_species desc; store ordered_genuses {code} The higher the value of genus_counts, the more species tuples will be contained in the critters bag, the wider the row. This can cause a severe processing imbalance, as the partitioner processing the records with the highest values of genus_counts will have the same number of *records* as the partitioner processing the lowest number, but it will have far more actual *bytes* to work on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1606) flatten documentation does not discuss flatten of empty bag
[ https://issues.apache.org/jira/browse/PIG-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909810#action_12909810 ] Olga Natkovich commented on PIG-1606: - If we are not planning to change the semantics I will ask Corinne to document for 0.8 flatten documentation does not discuss flatten of empty bag --- Key: PIG-1606 URL: https://issues.apache.org/jira/browse/PIG-1606 Project: Pig Issue Type: Bug Components: documentation Reporter: Thejas M Nair Fix For: 0.9.0 From the existing flatten documentation, it is not clear that flatten of an empty bag results in that row being discarded . For example the following query gives no output - {code} grunt cat /tmp/empty.bag {} 1 grunt l = load '/tmp/empty.bag' as (b : bag{}, i : int); grunt f = foreach l generate flatten(b), i; grunt dump f; grunt {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1606) flatten documentation does not discuss flatten of empty bag
[ https://issues.apache.org/jira/browse/PIG-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1606: Assignee: Corinne Chandel Fix Version/s: 0.8.0 (was: 0.9.0) flatten documentation does not discuss flatten of empty bag --- Key: PIG-1606 URL: https://issues.apache.org/jira/browse/PIG-1606 Project: Pig Issue Type: Bug Components: documentation Reporter: Thejas M Nair Assignee: Corinne Chandel Fix For: 0.8.0 From the existing flatten documentation, it is not clear that flatten of an empty bag results in that row being discarded . For example the following query gives no output - {code} grunt cat /tmp/empty.bag {} 1 grunt l = load '/tmp/empty.bag' as (b : bag{}, i : int); grunt f = foreach l generate flatten(b), i; grunt dump f; grunt {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1613) Explain how different UDF interfaces are used
Explain how different UDF interfaces are used - Key: PIG-1613 URL: https://issues.apache.org/jira/browse/PIG-1613 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.7.0 Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 The current documentation describes individual UDF interfaces such as Algebraic and Accumulator but not their precedence or how they interact with each other and why you might want to implement several of them. Corrine, I will add release notes to this JIRA shortly. Don't worry about it till then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1613) Explain how different UDF interfaces are used
[ https://issues.apache.org/jira/browse/PIG-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1613: Release Note: I think this should go into Advanced Topics in the UDF manual There are multiple ways for a UDF to be invoked. The simplest UDF can just extend EvalFunc that requires only exec function to be implemented as described in the How to Write a Simple Eval Function section. Every eval UDF must implement this. Additionally, if a function is algebraic, it can implement Algebraic interface to significantly improve query performance in the cases when combiner can be used. The Aggregate Functions section covers this topic in detail. Finally, a function that can process tuples in the incremental fashion can also implement Accumulator interface to improve query memory consumption. Accumulator interface section explains this interface. The exact method by which UDF is invoked is selected by the optimizer based on the UDF type and the query. Note that only a single interface is used at any given time. The optimizer tries to find the most efficient way to execute the function. If a combiner is used and function implements Algebraic interface then this interface will be used to invoke the function. If the combiner is not invoked but accumulator can be used and the function implements Accumulator interface then that interface is used. If neither of the conditions is satisfied then exec function is used to invoke the UDF. Can one of the developers review the release notes to make sure they are accurate, thanks. Explain how different UDF interfaces are used - Key: PIG-1613 URL: https://issues.apache.org/jira/browse/PIG-1613 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.7.0 Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 The current documentation describes individual UDF interfaces such as Algebraic and Accumulator but not their precedence or how they interact with each other and why you might want to implement several of them. Corrine, I will add release notes to this JIRA shortly. Don't worry about it till then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1578) PigServer.executeBatch does not return status of failed job for native mapreduce statement
[ https://issues.apache.org/jira/browse/PIG-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1578: Fix Version/s: (was: 0.8.0) PigServer.executeBatch does not return status of failed job for native mapreduce statement -- Key: PIG-1578 URL: https://issues.apache.org/jira/browse/PIG-1578 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Richard Ding For failed job PigServer.executeBatch does not return ExecJob . ExecJobs are created using output statistics, and the output statistics for jobs that failed does not seem to exist. The query i tried was a native mapreduce job, where the output file of the native mr job already exists causing that job to fail. {code} A = load ' + INPUT_FILE + '; B = mapreduce ' + jarFileName + ' + Store A into 'table_testNativeMRJobSimple_input' + Load 'table_testNativeMRJobSimple_output' + `WordCount table_testNativeMRJobSimple_input + INPUT_FILE + `;); Store B into 'table_testNativeMRJobSimpleDir';); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-815) misleading error message when streaming fails
[ https://issues.apache.org/jira/browse/PIG-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-815. Resolution: Won't Fix I don't think we have sufficient information to act on this misleading error message when streaming fails - Key: PIG-815 URL: https://issues.apache.org/jira/browse/PIG-815 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Gunther Hagleitner Fix For: 0.9.0 One of the users reported seeing a confusing message: Jobs not found in the JobClient. Please try to use Local, Hadoop Distributed or Hadoop MiniCluster modes instead of Hadoop LocalExecution ERROR 2055: Received Error while processing the map plan: 'process.pl ' failed with exit status: 255 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-638) error handling - enforce error codes
[ https://issues.apache.org/jira/browse/PIG-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-638: --- Fix Version/s: (was: 0.9.0) error handling - enforce error codes Key: PIG-638 URL: https://issues.apache.org/jira/browse/PIG-638 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Santhosh Srinivasan We should not allow exceptions that don't set error code as that kind of information is not helpful for users. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1017: Assignee: Thejas M Nair (was: Sriranjan Manjunath) We need to decide if this is something we should do for 0.9 Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Thejas M Nair Fix For: 0.9.0 Attachments: stotext.patch Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908984#action_12908984 ] Olga Natkovich commented on PIG-366: I think it used to use true local mode in pig. However, we no longer support this and the new version need to be connected to the current local mode in pig which is basically hadoop's local mode PigPen - Eclipse plugin for a graphical PigLatin editor --- Key: PIG-366 URL: https://issues.apache.org/jira/browse/PIG-366 Project: Pig Issue Type: New Feature Reporter: Shubham Chopra Assignee: Robert Gibbon Priority: Minor Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz This is an Eclipse plugin that provides a GUI that can help users create PigLatin scripts and see the example generator outputs on the fly and submit the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1600) Pig 080 Documentation
[ https://issues.apache.org/jira/browse/PIG-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908139#action_12908139 ] Olga Natkovich commented on PIG-1600: - I have reviewed the patch and will be committing it to trunk and 0.7 branch as soon as I have a successful doc build. Thanks, Corinne! Pig 080 Documentation - Key: PIG-1600 URL: https://issues.apache.org/jira/browse/PIG-1600 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.8.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.8.0 Attachments: pig080-1.patch, pig080-2-2.patch, pig080-2.patch Pig 080 documentation - new features, updates, an fixes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1600) Pig 080 Documentation
[ https://issues.apache.org/jira/browse/PIG-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908150#action_12908150 ] Olga Natkovich commented on PIG-1600: - pig080-2-2.patch committed to both trunk and 0.8 branch Pig 080 Documentation - Key: PIG-1600 URL: https://issues.apache.org/jira/browse/PIG-1600 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.8.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.8.0 Attachments: pig080-1.patch, pig080-2-2.patch, pig080-2.patch Pig 080 documentation - new features, updates, an fixes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1606) flatten documentation does not discuss flatten of empty bag
[ https://issues.apache.org/jira/browse/PIG-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1606: Fix Version/s: 0.9.0 flatten documentation does not discuss flatten of empty bag --- Key: PIG-1606 URL: https://issues.apache.org/jira/browse/PIG-1606 Project: Pig Issue Type: Bug Components: documentation Reporter: Thejas M Nair Fix For: 0.9.0 From the existing flatten documentation, it is not clear that flatten of an empty bag results in that row being discarded . For example the following query gives no output - {code} grunt cat /tmp/empty.bag {} 1 grunt l = load '/tmp/empty.bag' as (b : bag{}, i : int); grunt f = foreach l generate flatten(b), i; grunt dump f; grunt {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908249#action_12908249 ] Olga Natkovich commented on PIG-1608: - pig-default is the only one we include. The other one is for users. pig should always include pig-default.properties and pig.properties in the pig.jar -- Key: PIG-1608 URL: https://issues.apache.org/jira/browse/PIG-1608 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai pig should always include pig-default.properties and pig.properties as a part of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1606) flatten documentation does not discuss flatten of empty bag
[ https://issues.apache.org/jira/browse/PIG-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908250#action_12908250 ] Olga Natkovich commented on PIG-1606: - Is this even the semantics we want. I would expect a single row with an empty field. flatten documentation does not discuss flatten of empty bag --- Key: PIG-1606 URL: https://issues.apache.org/jira/browse/PIG-1606 Project: Pig Issue Type: Bug Components: documentation Reporter: Thejas M Nair Fix For: 0.9.0 From the existing flatten documentation, it is not clear that flatten of an empty bag results in that row being discarded . For example the following query gives no output - {code} grunt cat /tmp/empty.bag {} 1 grunt l = load '/tmp/empty.bag' as (b : bag{}, i : int); grunt f = foreach l generate flatten(b), i; grunt dump f; grunt {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907805#action_12907805 ] Olga Natkovich commented on PIG-1518: - Hi Justin, thanks for the patch! I don't think we can commit it to 0.7 patch because we have already done the official 0.7 release and we can't introduce non-backward compatible changes to this branch. However, I think it is great to have the patch on the JIRA so that anybody who is interested in this patch can apply it to their own tree and run with it. We have done similar things in the past (with hadoop versions) and it worked fine. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1600) Pig 080 Documentation
[ https://issues.apache.org/jira/browse/PIG-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906094#action_12906094 ] Olga Natkovich commented on PIG-1600: - patch committed to 0.8 branch; trunk is next Pig 080 Documentation - Key: PIG-1600 URL: https://issues.apache.org/jira/browse/PIG-1600 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.8.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Attachments: pig080-1.patch Pig 080 documentation - new features, updates, an fixes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1600) Pig 080 Documentation
[ https://issues.apache.org/jira/browse/PIG-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906107#action_12906107 ] Olga Natkovich commented on PIG-1600: - patch committed to the trunk as well. thanks, corinne! Pig 080 Documentation - Key: PIG-1600 URL: https://issues.apache.org/jira/browse/PIG-1600 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.8.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.8.0 Attachments: pig080-1.patch Pig 080 documentation - new features, updates, an fixes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1600) Pig 080 Documentation
[ https://issues.apache.org/jira/browse/PIG-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1600: Fix Version/s: 0.8.0 Pig 080 Documentation - Key: PIG-1600 URL: https://issues.apache.org/jira/browse/PIG-1600 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.8.0 Reporter: Corinne Chandel Assignee: Corinne Chandel Priority: Blocker Fix For: 0.8.0 Attachments: pig080-1.patch Pig 080 documentation - new features, updates, an fixes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it
[ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905628#action_12905628 ] Olga Natkovich commented on PIG-1544: - I am going to take my previous comment back and say that we should make this work for UDFs as well. The main reason for this is that we don't have another way to make sure that UDFs do not run out of memory. One approach that Alan proposed was to make bags when they are created to ask for memory and have a central broker in charge of the memory pool. The details of this or whether there is a better approach need to be still thought through. proactive-spill bags should share the memory alloted for it --- Key: PIG-1544 URL: https://issues.apache.org/jira/browse/PIG-1544 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage . But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. This needs to be fixed and all proactive-spill bags should share the memory-limit . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1544) proactive-spill bags should share the memory alloted for it
[ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1544: Assignee: Thejas M Nair Fix Version/s: 0.9.0 proactive-spill bags should share the memory alloted for it --- Key: PIG-1544 URL: https://issues.apache.org/jira/browse/PIG-1544 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.9.0 Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage . But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. This needs to be fixed and all proactive-spill bags should share the memory-limit . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Sort Merge Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1309: Summary: Sort Merge Cogroup (was: Map-side Cogroup) Sort Merge Cogroup -- Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0, 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1550) better error handling in casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905717#action_12905717 ] Olga Natkovich commented on PIG-1550: - I will review the patch better error handling in casting relations to scalars - Key: PIG-1550 URL: https://issues.apache.org/jira/browse/PIG-1550 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1550.1.patch I ran the following script: Input data: joe 100 sam 20 bob 134 Script: A = load 'user_clicks' as (user: chararray, clicks: int); B = group A by user; C = foreach A generate group, SUM(A.clicks); D = foreach A generate clicks/(double)C.$1; dump C; Since C contains more than 1 tuple, I expected to get an error which I did. However, the error was not very clear. When the job failed, I did see a valid error (however it lacked the error code): 210630 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: Scalar has more than one row in the output However at the end of processing, I saw a misleading error: 210709 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2088: Unable to get results for: hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage 10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1550) better error handling in casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905731#action_12905731 ] Olga Natkovich commented on PIG-1550: - +1, looks good better error handling in casting relations to scalars - Key: PIG-1550 URL: https://issues.apache.org/jira/browse/PIG-1550 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1550.1.patch I ran the following script: Input data: joe 100 sam 20 bob 134 Script: A = load 'user_clicks' as (user: chararray, clicks: int); B = group A by user; C = foreach A generate group, SUM(A.clicks); D = foreach A generate clicks/(double)C.$1; dump C; Since C contains more than 1 tuple, I expected to get an error which I did. However, the error was not very clear. When the job failed, I did see a valid error (however it lacked the error code): 210630 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: Scalar has more than one row in the output However at the end of processing, I saw a misleading error: 210709 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2088: Unable to get results for: hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage 10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1594) NullPointerException in new logical planner
[ https://issues.apache.org/jira/browse/PIG-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1594: Assignee: Daniel Dai Fix Version/s: 0.8.0 NullPointerException in new logical planner --- Key: PIG-1594 URL: https://issues.apache.org/jira/browse/PIG-1594 Project: Pig Issue Type: Bug Reporter: Andrew Hitchcock Assignee: Daniel Dai Fix For: 0.8.0 I've been testing the trunk version of Pig on Elastic MapReduce against our log processing sample application(1). When I try to run the query it throws a NullPointerException and suggests I disable the new logical plan. Disabling it works and the script succeeds. Here is the query I'm trying to run: {code} register file:/home/hadoop/lib/pig/piggybank.jar DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); RAW_LOGS = LOAD '$INPUT' USING TextLoader as (line:chararray); LOGS_BASE= foreach RAW_LOGS generate FLATTEN(EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] (.+?) (\\S+) (\\S+) ([^]*) ([^]*)')) as (remoteAddr:chararray, remoteLogname:chararray, user:chararray, time:chararray, request:chararray, status:int, bytes_string:chararray, referrer:chararray, browser:chararray); REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer; FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer matches '.*google.*'; SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, '.*[\\?]q=([^]+).*')) as terms:chararray; SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL; SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE $0, COUNT($1) as num; SEARCH_TERMS_COUNT_SORTED = LIMIT(ORDER SEARCH_TERMS_COUNT BY num DESC) 50; STORE SEARCH_TERMS_COUNT_SORTED into '$OUTPUT'; {code} And here is the stack trace that results: {code} ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false. org.apache.pig.backend.executionengine.ExecException: ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:285) at org.apache.pig.PigServer.compilePp(PigServer.java:1301) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1154) at org.apache.pig.PigServer.execute(PigServer.java:1148) at org.apache.pig.PigServer.access$100(PigServer.java:123) at org.apache.pig.PigServer$Graph.execute(PigServer.java:1464) at org.apache.pig.PigServer.executeBatchEx(PigServer.java:350) at org.apache.pig.PigServer.executeBatch(PigServer.java:324) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:111) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:491) at org.apache.pig.Main.main(Main.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.NullPointerException at org.apache.pig.EvalFunc.getSchemaName(EvalFunc.java:76) at org.apache.pig.piggybank.impl.ErrorCatchingBase.outputSchema(ErrorCatchingBase.java:76) at org.apache.pig.newplan.logical.expression.UserFuncExpression.getFieldSchema(UserFuncExpression.java:111) at org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:175) at org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143) at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:55) at org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:69) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:87) at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:149) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74) at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:76) at
[jira] Updated: (PIG-1199) help includes obsolete options
[ https://issues.apache.org/jira/browse/PIG-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1199: Release Note: Help now takes properties keyword to show all java properties supported by Pig: The following properties are supported: Logging: verbose=true|false; default is false. This property is the same as -v switch brief=true|false; default is false. This property is the same as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch ... help includes obsolete options -- Key: PIG-1199 URL: https://issues.apache.org/jira/browse/PIG-1199 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 Attachments: PIG-1199.patch, PIG-1199_2.patch This is confusing to users -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1585) Add new properties to help and documentation
[ https://issues.apache.org/jira/browse/PIG-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905323#action_12905323 ] Olga Natkovich commented on PIG-1585: - Since this is just a minor cosmetic patch, I am just planning to commit the changes to both the branch and the trunk without tests and review. Add new properties to help and documentation Key: PIG-1585 URL: https://issues.apache.org/jira/browse/PIG-1585 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 Attachments: PIG-1585.patch New properties: Compression: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts gz and lzo as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. Combining small files: pig.noSplitCombination - disables combining multiple small files to the block size -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1585) Add new properties to help and documentation
[ https://issues.apache.org/jira/browse/PIG-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1585: Attachment: PIG-1585.patch Add new properties to help and documentation Key: PIG-1585 URL: https://issues.apache.org/jira/browse/PIG-1585 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 Attachments: PIG-1585.patch New properties: Compression: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts gz and lzo as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. Combining small files: pig.noSplitCombination - disables combining multiple small files to the block size -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1585) Add new properties to help and documentation
[ https://issues.apache.org/jira/browse/PIG-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1585. - Resolution: Fixed patch committed to both trunk and 0.8 branch. I also added LogicalExpressionSimplifier to the help Add new properties to help and documentation Key: PIG-1585 URL: https://issues.apache.org/jira/browse/PIG-1585 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 Attachments: PIG-1585.patch New properties: Compression: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts gz and lzo as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. Combining small files: pig.noSplitCombination - disables combining multiple small files to the block size -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1429) Add Boolean Data Type to Pig
[ https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1429: Fix Version/s: (was: 0.8.0) Unlinking because we are branching for release today Add Boolean Data Type to Pig Key: PIG-1429 URL: https://issues.apache.org/jira/browse/PIG-1429 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Russell Jurney Attachments: working_boolean.patch Original Estimate: 8h Remaining Estimate: 8h Pig needs a Boolean data type. Pig-1097 is dependent on doing this. I volunteer. Is there anything beyond the work in src/org/apache/pig/data/ plus unit tests to make this work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1549) Provide utility to construct CNF form of predicates
[ https://issues.apache.org/jira/browse/PIG-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1549: Fix Version/s: (was: 0.8.0) Unlinking from 0.8 release since we are about to branch Provide utility to construct CNF form of predicates --- Key: PIG-1549 URL: https://issues.apache.org/jira/browse/PIG-1549 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.8.0 Reporter: Swati Jain Assignee: Swati Jain Attachments: 0001-Add-CNF-utility-class.patch Provide utility to construct CNF form of predicates -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1530) PIG Logical Optimization: Push LOFilter above LOCogroup
[ https://issues.apache.org/jira/browse/PIG-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1530. - Resolution: Duplicate Xuefu is addressing this issue as part of https://issues.apache.org/jira/browse/PIG-1575. PIG Logical Optimization: Push LOFilter above LOCogroup Key: PIG-1530 URL: https://issues.apache.org/jira/browse/PIG-1530 Project: Pig Issue Type: New Feature Components: impl Reporter: Swati Jain Assignee: Swati Jain Priority: Minor Fix For: 0.8.0 Consider the following: {noformat} A = load 'any file' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'any file' USING PigStorage(',') as (b1:int,b2:int,b3:int); G = COGROUP A by (a1,a2) , B by (b1,b2); D = Filter G by group.$0 + 5 group.$1; explain D; {noformat} In the above example, LOFilter can be pushed above LOCogroup. Note there are some tricky NULL issues to think about when the Cogroup is not of type INNER (Similar to issues that need to be thought through when pushing LOFilter on the right side of a LeftOuterJoin). Also note that typically the LOFilter in user programs will be below a ForEach-Cogroup pair. To make this really useful, we need to also implement LOFilter pushed across ForEach. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter
[ https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1494: Unlinking from 0.8 since we are about to branch for release PIG Logical Optimization: Use CNF in PushUpFilter - Key: PIG-1494 URL: https://issues.apache.org/jira/browse/PIG-1494 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Swati Jain Assignee: Swati Jain Priority: Minor The PushUpFilter rule is not able to handle complicated boolean expressions. For example, SplitFilter rule is splitting one LOFilter into two by AND. However it will not be able to split LOFilter if the top level operator is OR. For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5);* explain D; In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into Conjunctive Normal Form (CNF) then we would be able to push filter condition c1 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) *Suggestion:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP
[ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904785#action_12904785 ] Olga Natkovich commented on PIG-1506: - This is what we need to document: In the case of GROUP/COGROUP, the data with NULL key from the same input is grouped together. For instance: Input data: joe 5 2.5 sam 3.0 bob 3.5 script: A = load 'small' as (name, age, gpa); B = group A by age; dump B; Output: (5,{(joe,5,2.5)}) (,{(sam,,3.0),(bob,,3.5)}) Note that both records with null age are grouped together. However, data with null keys from different inputs is considered different and will generate multiple tuples in case of cogroup. For instance: Input: Self cogroup on the same input. Script: A = load 'small' as (name, age, gpa); B = load 'small' as (name, age, gpa); C = cogroup A by age, B by age; dump C; Output: (5,{(joe,5,2.5)},{(joe,5,2.5)}) (,{(sam,,3.0),(bob,,3.5)},{}) (,{},{(sam,,3.0),(bob,,3.5)}) Note that there are 2 tuples in the output corresponding to the null key: one that contains tuples from the first input (with no much from the second) and one the other way around. JOIN adds another interesting twist to this because it follows SQL standard which means that JOIN by default represents inner join which through away all the nulls. Input: the same as for COGROUP Script: A = load 'small' as (name, age, gpa); B = load 'small' as (name, age, gpa); C = join A by age, B by age; dump C; Output: (joe,5,2.5,joe,5,2.5) Note that all tuples that had NULL key got filtered out. Need to clarify the difference between null handling in JOIN and COGROUP Key: PIG-1506 URL: https://issues.apache.org/jira/browse/PIG-1506 Project: Pig Issue Type: Improvement Components: documentation Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1584) deal with inner cogroup
deal with inner cogroup --- Key: PIG-1584 URL: https://issues.apache.org/jira/browse/PIG-1584 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Fix For: 0.9.0 The current implementation of inner in case of cogroup is in conflict with join. We need to decide of whether to fix inner cogroup or just remove the functionality if it is not widely used -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP
[ https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904829#action_12904829 ] Olga Natkovich commented on PIG-1506: - I verified that 0.8 code does deal correctly with multi-column keys with nulls Need to clarify the difference between null handling in JOIN and COGROUP Key: PIG-1506 URL: https://issues.apache.org/jira/browse/PIG-1506 Project: Pig Issue Type: Improvement Components: documentation Reporter: Olga Natkovich Assignee: Corinne Chandel Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1585) Add new properties to help and documentation
Add new properties to help and documentation Key: PIG-1585 URL: https://issues.apache.org/jira/browse/PIG-1585 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 New properties: Compression: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts gz and lzo as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. Combining small files: pig.noSplitCombination - disables combining multiple small files to the block size -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904848#action_12904848 ] Olga Natkovich commented on PIG-1501: - Ashutosh, The reason it is off by default is because the default compression is gzip which is really slow and most of the time not what you want. Because of the licensing issue with lzo, users need to setup it on their own. Once they do the setup, they can enable the compression. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)
[ https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1586: --- Assignee: Viraj Bhat Viraj volunteered to print the line that pig gets as part of parameter substitution to see if the escapes and quotes are eaten by the shell. Thanks Viraj Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem) Key: PIG-1586 URL: https://issues.apache.org/jira/browse/PIG-1586 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Viraj Bhat Assignee: Viraj Bhat I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \ -param OUTPUT=\'/user/viraj/output\' USING PigStorage() {code} {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)
[ https://issues.apache.org/jira/browse/PIG-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1588. - Resolution: Duplicate This is duplicate of https://issues.apache.org/jira/browse/PIG-1586 and at this point we do not believe that either is a bug in pig. Viraj is verifying that but we think that shell removes the escapes before giving it to Pig Parameter pre-processing of values containing pig positional variables ($0, $1 etc) --- Key: PIG-1588 URL: https://issues.apache.org/jira/browse/PIG-1588 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Laukik Chitnis Fix For: 0.7.0 Pig 0.7 requires the positional variables to be escaped by a \\ when passed as part of a parameter value (either through cmd line param or through param_file), which was not the case in Pig 0.6 Assuming that this was not an intended breakage of backward compatibility (could not find it in release notes), this would be a bug. For example, We need to pass INPUT=CountWords(\\$0,\\$1,\\$2) instead of simply INPUT=CountWords($0,$1,$2) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage
[ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1537. - Resolution: Fixed Column pruner causes wrong results when using both Custom Store UDF and PigStorage -- Key: PIG-1537 URL: https://issues.apache.org/jira/browse/PIG-1537 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.8.0 I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); {code} I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias count produces the same number of records but ss_sc_all_map have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-747) Logical to Physical Plan Translation fails when temporary alias are created within foreach
[ https://issues.apache.org/jira/browse/PIG-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-747: --- Fix Version/s: 0.9.0 (was: 0.8.0) Logical to Physical Plan Translation fails when temporary alias are created within foreach -- Key: PIG-747 URL: https://issues.apache.org/jira/browse/PIG-747 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.9.0 Attachments: physicalplan.txt, physicalplanprob.pig, PIG-747-1.patch Consider a the pig script which calculates a new column F inside the foreach as: {code} A = load 'physicalplan.txt' as (col1,col2,col3); B = foreach A { D = col1/col2; E = col3/col2; F = E - (D*D); generate F as newcol; }; dump B; {code} This gives the following error: === Caused by: org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException: ERROR 2015: Invalid physical operators in the physical plan at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:377) at org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:63) at org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:29) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:68) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:908) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:122) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:41) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246) ... 10 more Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give operator of type org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide multiple outputs. This operator does not support multiple outputs. at org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:158) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:89) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:373) ... 19 more === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1319) New logical optimization rules
[ https://issues.apache.org/jira/browse/PIG-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1319: Fix Version/s: 0.9.0 (was: 0.8.0) New logical optimization rules -- Key: PIG-1319 URL: https://issues.apache.org/jira/browse/PIG-1319 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.9.0 In [PIG-1178|https://issues.apache.org/jira/browse/PIG-1178], we build a new logical optimization framework. One design goal for the new logical optimizer is to make it easier to add new logical optimization rules. In this Jira, we keep track of the development of these new logical optimization rules. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904450#action_12904450 ] Olga Natkovich commented on PIG-1563: - Dmitry, thanks for the review. I did not discard your function - it was part of the patch. I did not change the code to use it just because I already finished testing the changes and did not have time to redo the code. I am fixing some javadoc and release audit failures and will commit the code shortly. SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) Some string functions don't work with bytearray arguments
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904462#action_12904462 ] Olga Natkovich commented on PIG-1563: - +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 13 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] Some string functions don't work with bytearray arguments - Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) Some string functions don't work with bytearray arguments
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904467#action_12904467 ] Olga Natkovich commented on PIG-1563: - I made one additional change and renamed SPLIT into STRSPLIT to avoid conflict with SPLIT operator Some string functions don't work with bytearray arguments - Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1563) Some string functions don't work with bytearray arguments
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1563: Attachment: PIG_1563_v3.patch latest patch Some string functions don't work with bytearray arguments - Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch, PIG_1563_v3.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1563) Some string functions don't work with bytearray arguments
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1563: Status: Resolved (was: Patch Available) Resolution: Fixed patch committed. Thanks Dmitry for the help and review Some string functions don't work with bytearray arguments - Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch, PIG_1563_v3.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903578#action_12903578 ] Olga Natkovich commented on PIG-1563: - I was able to make it successfully working (without wrapping) for the functions that have fixed number of arguments: LAST_INDEX_OF REPLACE TRIM I don't believe there is currently a way to make it work with variable number of args (even if the number of combinations is fixed.) Moreover, if we add the mapping table in this case, it breaks the case of typed data which is bad. This is the case with the remaining functions - INDEXOF and SPLIT. So my suggestion is only to fix the first set of function and delay the rest to 0.9 when we fix the mapping code. Dmitry and others, are you ok with this? If so, I can update the patch to reflect this. SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1502) Document and track system limits
[ https://issues.apache.org/jira/browse/PIG-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1502: Fix Version/s: 0.9.0 (was: 0.8.0) Document and track system limits Key: PIG-1502 URL: https://issues.apache.org/jira/browse/PIG-1502 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Fix For: 0.9.0 We need to be able to publsih what system limitations are to make sure that Pig is used in the way it was intended and tested. For instance, if you combine 30 joins in a single MR job (via multiquery) this might not work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903581#action_12903581 ] Olga Natkovich commented on PIG-1150: - Dmitry, are you planning to add unit tests? Do we still want this in for 0.8? (Since it is going into piggybank, we can do this post branching but then we need to test in 2 places.) VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1549) Provide utility to construct CNF form of predicates
[ https://issues.apache.org/jira/browse/PIG-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903591#action_12903591 ] Olga Natkovich commented on PIG-1549: - I don't think this patch applies. can you regenerate the patch with svn diff from the latest code and also add unit tests, thanks Provide utility to construct CNF form of predicates --- Key: PIG-1549 URL: https://issues.apache.org/jira/browse/PIG-1549 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.8.0 Reporter: Swati Jain Assignee: Swati Jain Fix For: 0.8.0 Attachments: 0001-Add-CNF-utility-class.patch Provide utility to construct CNF form of predicates -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter
[ https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903593#action_12903593 ] Olga Natkovich commented on PIG-1494: - Can this be moved from 0.8 to 0.9 release since we are about to branch for 0.9? PIG Logical Optimization: Use CNF in PushUpFilter - Key: PIG-1494 URL: https://issues.apache.org/jira/browse/PIG-1494 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Swati Jain Assignee: Swati Jain Priority: Minor Fix For: 0.8.0 The PushUpFilter rule is not able to handle complicated boolean expressions. For example, SplitFilter rule is splitting one LOFilter into two by AND. However it will not be able to split LOFilter if the top level operator is OR. For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5);* explain D; In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into Conjunctive Normal Form (CNF) then we would be able to push filter condition c1 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) *Suggestion:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1542) log level not propogated to MR task loggers
[ https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1542: --- Assignee: niraj rai This will be looked at after the branch since this is a regression and we don't have time to do it now. log level not propogated to MR task loggers --- Key: PIG-1542 URL: https://issues.apache.org/jira/browse/PIG-1542 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: niraj rai Fix For: 0.8.0 Specifying -d DEBUG does not affect the logging of the MR tasks . This was fixed earlier in PIG-882 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1543: --- Assignee: Daniel Dai Daniel can you check if this is related to limit optimizer and if it was addressed with new optimizer. (This can be done post branch since it is a bug split.) IsEmpty returns the wrong value after using LIMIT - Key: PIG-1543 URL: https://issues.apache.org/jira/browse/PIG-1543 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Hu Assignee: Daniel Dai Fix For: 0.8.0 1. Two input files: 1a: limit_empty.input_a 1 1 1 1b: limit_empty.input_b 2 2 2. The pig script: limit_empty.pig -- A contains only 1's B contains only 2's A = load 'limit_empty.input_a' as (a1:int); B = load 'limit_empty.input_a' as (b1:int); C =COGROUP A by a1, B by b1; D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), COUNT(B); store D into 'limit_empty.output/d'; -- After the script done, we see the right results: -- {(1),(1),(1)} {} 1 0 3 0 -- {} {(2),(2)} 0 1 0 2 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), COUNT(Alim), COUNT(Blim); store D1 into 'limit_empty.output/d1'; -- After the script done, we see the unexpected results: -- {(1)} {}1 1 1 0 -- {} {(2)} 1 1 0 1 dump D; dump D1; 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: The major one: IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while IsEmpty() returns correctly in limit_empty.output/d/*. The difference is that one has been applied with LIMIT before using IsEmpty(). The minor one: The redirected output only contains the first dump: ({(1),(1),(1)},{},1,0,3L,0L) ({},{(2),(2)},0,1,0L,2L) We expect two more lines like: ({(1)},{},1,1,1L,0L) ({},{(2)},1,1,0L,1L) Besides, there is error says: [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1567) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly
[ https://issues.apache.org/jira/browse/PIG-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1567: --- Assignee: Xuefu Zhang Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly -- Key: PIG-1567 URL: https://issues.apache.org/jira/browse/PIG-1567 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 FilterAboveForeach rule is to optimize the plan by pushing up filter above previous foreach operator. However, during code review, two major problems were found: 1. Current implementation assumes that if no projection is found in the filter condition then all columns from foreach are projected. This issue prevents the following optimization: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY 8 5; STORE C INTO 'empty'; 2. Current implementation doesn't handle * probjection, which means project all columns. As a result, it wasn't able to optimize the following: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY Identity.class.getName(*) 5; STORE C INTO 'empty'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs
[ https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1570: --- Assignee: Thejas M Nair native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs - Key: PIG-1570 URL: https://issues.apache.org/jira/browse/PIG-1570 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior. For example, even if the MR job for mapreduce operator fails, the number of jobs that failed is being reported as 0 in PigStats log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1572) change default datatype when relations are used as scalar to bytearray
[ https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1572: --- Assignee: Thejas M Nair change default datatype when relations are used as scalar to bytearray -- Key: PIG-1572 URL: https://issues.apache.org/jira/browse/PIG-1572 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 When relations are cast to scalar, the current default type is chararray. This is inconsistent with the behavior in rest of pig-latin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903637#action_12903637 ] Olga Natkovich commented on PIG-1150: - So should we unlink this from the release? VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903640#action_12903640 ] Olga Natkovich commented on PIG-1563: - which JIRA is that? I will just get this in - I think that's all I have time today but I can look at the other one as well next week SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1150: Fix Version/s: 0.9.0 (was: 0.8.0) VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.9.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-529) Want support for loading CSV files
[ https://issues.apache.org/jira/browse/PIG-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-529. Resolution: Duplicate This is duplicate of PIG-1555 which has been resolved for Pig 0.8 Want support for loading CSV files -- Key: PIG-529 URL: https://issues.apache.org/jira/browse/PIG-529 Project: Pig Issue Type: New Feature Components: data Reporter: Tom White Want to be able to load CSV data into Pig. This needs to handle quoting correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??
[ https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-771. Fix Version/s: 0.7.0 Resolution: Fixed PigDump is no longer supported PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ?? -- Key: PIG-771 URL: https://issues.apache.org/jira/browse/PIG-771 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz Fix For: 0.7.0 PigDump does not properly output Chinese UTF8 characters. The reason for this is that the function Tuple.toString() is called. DefaultTuple implements Tuple.toString() and it calls Object.toString() on the opaque object d. Instead, I think that the code should be changed instead to call the new DataType.toString() function. {code} @Override public String toString() { StringBuilder sb = new StringBuilder(); sb.append('('); for (IteratorObject it = mFields.iterator(); it.hasNext();) { Object d = it.next(); if(d != null) { if(d instanceof Map) { sb.append(DataType.mapToString((MapObject, Object)d)); } else { sb.append(DataType.toString(d)); // Change this one line if(d instanceof Long) { sb.append(L); } else if(d instanceof Float) { sb.append(F); } } } else { sb.append(); } if (it.hasNext()) sb.append(,); } sb.append(')'); return sb.toString(); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1577) support to variable number of arguments in UDF
support to variable number of arguments in UDF -- Key: PIG-1577 URL: https://issues.apache.org/jira/browse/PIG-1577 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Olga Natkovich Fix For: 0.9.0 In the current implementation, functionality that allows to map arguments to classes does not support functions with variable number of arguments. Also it does not support funtions that can have variable (but fixed in number) number of arguments. This causes problems for string UDFs such as CONCAT that can take an arbitrary number of arguments or TRIM that can take 1,2, or 3 arguments -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1563: Attachment: PIG_1563_v2.patch SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903744#action_12903744 ] Olga Natkovich commented on PIG-1563: - Uploaded new patch which does the following: (1) Adds mapping function for functions with fixed number of arguments: SUBSTRING, LAST_INDEX_OF, REPLACE,TRIM (2) Left the rest of the functions alone which means that until 0.9 they will only work on typed data. CONCAT is in the same category (3) Re-used applicable tests that Dmitry create, thanks! (3) Added a couple of e2e tests to make sure that we test the mapping function as well Please, review. We will keep the open till we address (2) in 0.9. SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven
[ https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1562: Fix Version/s: 0.8.0 Fix the version for the dependent packages for the maven - Key: PIG-1562 URL: https://issues.apache.org/jira/browse/PIG-1562 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Fix For: 0.8.0 We need to fix the set version so that, version is properly set for the dependent packages in the maven repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1560) Build target 'checkstyle' fails
[ https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901975#action_12901975 ] Olga Natkovich commented on PIG-1560: - please, commit Build target 'checkstyle' fails --- Key: PIG-1560 URL: https://issues.apache.org/jira/browse/PIG-1560 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: pig-1560.patch Stack trace: {code} /trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date
[ https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901979#action_12901979 ] Olga Natkovich commented on PIG-1559: - Looks like limit issue I was seeing has been addressed in the latest trunk. I think we need to add unit tests to catch this things in the future. Several things stated in Pig philosophy page are out of date Key: PIG-1559 URL: https://issues.apache.org/jira/browse/PIG-1559 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1559.patch The Pig philosophy page says several things that are no longer true (such as that Pig does not have an optimizer (it does now), that we someday hope to support streaming (we already do), that we some day hope to control splits (we don't, we just use what Hadoop gives us now)). These need to be updated to reflect the current situation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date
[ https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901984#action_12901984 ] Olga Natkovich commented on PIG-1559: - sorry, wrong JIRA Several things stated in Pig philosophy page are out of date Key: PIG-1559 URL: https://issues.apache.org/jira/browse/PIG-1559 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1559.patch The Pig philosophy page says several things that are no longer true (such as that Pig does not have an optimizer (it does now), that we someday hope to support streaming (we already do), that we some day hope to control splits (we don't, we just use what Hadoop gives us now)). These need to be updated to reflect the current situation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901985#action_12901985 ] Olga Natkovich commented on PIG-1557: - Looks like limit issue I was seeing has been addressed in the latest trunk. I think we need to add unit tests to catch this things in the future. couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1563) SUBSTRING function is broken
SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902211#action_12902211 ] Olga Natkovich commented on PIG-1563: - The same needs to be done (and we need unit tests) for the following string manipulation functions: INDEXOF LAST_INDEX_OF REPLACE SPLIT TRIM SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-908) Need a way to correlate MR jobs with Pig statements
[ https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-908: --- With Pig 0.8.0 we print a summary of the execution that contains (among other things) how aliases mapped to jobs. Example: JobId MapsReduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201004271216_12712 1 1 3 3 3 12 12 12 B,C GROUP_BY,COMBINER job_201004271216_12713 1 1 3 3 3 12 12 12 D SAMPLER job_201004271216_12714 1 1 3 3 3 12 12 12 D ORDER_BY,COMBINER hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp743703298/tmp-2019944040, Need a way to correlate MR jobs with Pig statements --- Key: PIG-908 URL: https://issues.apache.org/jira/browse/PIG-908 Project: Pig Issue Type: Wish Reporter: Dmitriy V. Ryaboy Assignee: Richard Ding Fix For: 0.8.0 Complex Pig Scripts often generate many Map-Reduce jobs, especially with the recent introduction of multi-store capabilities. For example, the first script in the Pig tutorial produces 5 MR jobs. There is currently very little support for debugging resulting jobs; if one of the MR jobs fails, it is hard to figure out which part of the script it was responsible for. Explain plans help, but even with the explain plan, a fair amount of effort (and sometimes, experimentation) is required to correlate the failing MR job with the corresponding PigLatin statements. This ticket is created to discuss approaches to alleviating this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1488) Make HDFS temp dir configurable
[ https://issues.apache.org/jira/browse/PIG-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1488: Release Note: Pig stores intermediate data generated between MR jobs in a temp location on HDFS. In Pig 0.8.0 this location is configurable by using pig.temp.dir property. The default is /tmp which is the same as hardcoded location in Pig 0.7.0 and earlier versions Make HDFS temp dir configurable --- Key: PIG-1488 URL: https://issues.apache.org/jira/browse/PIG-1488 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Fix For: 0.8.0 Currently it is hardcoded to /tmp. It should be made into a property. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1484: Release Note: In Pig 0.7.0 only a single location is supported as input to BinStorage. (This location can be a file, a directory or a glob). With Pig 0.8.0 we are making BinSTorage (similar to PigStorage) support a list of locations. Example: a = load '1.bin,2.bin' using BinStorage(); BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0, 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1557) couple of issue mapping aliases to jobs
couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag
[ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901576#action_12901576 ] Olga Natkovich commented on PIG-1447: - This is probably the smallest patch I have reviewed recently :). +1 Tune memory usage of InternalCachedBag -- Key: PIG-1447 URL: https://issues.apache.org/jira/browse/PIG-1447 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch We need to find a better value for pig.cachedbag.memusage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901577#action_12901577 ] Olga Natkovich commented on PIG-1354: - Dmitry, Could you add release notes on how to use this? UDFs for dynamic invocation of simple Java methods -- Key: PIG-1354 URL: https://issues.apache.org/jira/browse/PIG-1354 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch The need to create wrapper UDFs for simple Java functions creates unnecessary work for Pig users, slows down the development process, and produces a lot of trivial classes. We can use Java's reflection to allow invoking a number of methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.