[jira] Updated: (PIG-1272) Column pruner causes wrong results
[ https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1272: Status: Patch Available (was: Reopened) Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1272-1.patch, PIG-1272-2.patch For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845083#action_12845083 ] Hadoop QA commented on PIG-1178: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438738/pig_1178_3.3.patch against trunk revision 922664. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 28 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/251/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/251/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/251/console This message is automatically generated. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1272) Column pruner causes wrong results
[ https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845162#action_12845162 ] Hadoop QA commented on PIG-1272: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438743/PIG-1272-2.patch against trunk revision 922664. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/252/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/252/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/252/console This message is automatically generated. Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1272-1.patch, PIG-1272-2.patch For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1272) Column pruner causes wrong results
[ https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1272: Resolution: Fixed Status: Resolved (was: Patch Available) Manual unit test pass. Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1272-1.patch, PIG-1272-2.patch For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file kv contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the ColumnPruner optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845173#action_12845173 ] Daniel Dai commented on PIG-1178: - pig_1178_3.3.patch committed. Manual unit pass. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-200) Pig Performance Benchmarks
[ https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845174#action_12845174 ] duncan commented on PIG-200: Hi Daniel, How can I run the perf.patch? I saw a lot of different things in the perf.patch. I want to generate data set and use those 14 pig queries for benchmarking. Would you mind telling me more on how to use the perf.patch? Thanks Duncan Pig Performance Benchmarks -- Key: PIG-200 URL: https://issues.apache.org/jira/browse/PIG-200 Project: Pig Issue Type: Task Reporter: Amir Youssefi Assignee: Alan Gates Attachments: generate_data.pl, perf.hadoop.patch, perf.patch To benchmark Pig performance, we need to have a TPC-H like Large Data Set plus Script Collection. This is used in comparison of different Pig releases, Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only). Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance I am currently running long-running Pig scripts over data-sets in the order of tens of TBs. Next step is hundreds of TBs. We need to have an open large-data set (open source scripts which generate data-set) and detailed scripts for important operations such as ORDER, AGGREGATION etc. We can call those the Pig Workouts: Cardio (short processing), Marathon (long running scripts) and Triathlon (Mix). I will update this JIRA with more details of current activities soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.