[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Release Note: Feature: combine splits of sizes smaller than the value of property pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is not set, the file system default block size of the load's location. This feature can be turned off through setting the property pig.splitCombination to false. When such a combination is performed, a log message like Total input paths (combined) to process : 7 will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more under-fed mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. This change also requires the loader to be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument. Otherwise, this feature should be disabled. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations. was: Feature: combine splits of sizes smaller than the value of property pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is not set, the file system default block size of the load's location. This feature can be turned off through setting the property pig.noSplitCombination to true. When such a combination is performed, a log message like Total input paths (combined) to process : 7 will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more under-fed mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1642: - Assignee: Richard Ding Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1641: -- Fix Version/s: 0.8.0 Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.8.0 User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1641: -- Status: Patch Available (was: Open) Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1641.patch User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1641: -- Attachment: PIG-1641.patch Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1641.patch User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1643: --- Attachment: PIG-1643.1.patch PIG-1643.1.patch There was a code path that lead to fields having NULL datatype instead of the default datatype of BYTEARRAY. That was causing these failures. Test-patch has succeeded, unit tests are running. join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1643: --- Status: Patch Available (was: Open) join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914126#action_12914126 ] Daniel Dai commented on PIG-1643: - +1 if tests pass. join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash
[ https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914128#action_12914128 ] Yan Zhou commented on PIG-1645: --- The problem is that both RandomSampleLoader and PossionSampleLoader have internal states from the previous invocations that should be reset when a different underlying split is worked on under the same umbrella split when the split combination (PIG-1518) is on. When temporary file compression is disabled, Pig internal storage will create empty files which will be discarded by split combiner, making the only non-empty split as the only split to be worked on, so it is ok in this case. Using both small split combination and temporary file compression on a query of ORDER BY may cause crash Key: PIG-1645 URL: https://issues.apache.org/jira/browse/PIG-1645 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.8.0 The stack looks like the following: java.lang.NullPointerException at java.util.Arrays.binarySearch(Arrays.java:2043) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914130#action_12914130 ] Thejas M Nair commented on PIG-1644: These operations will be fairly common in the optimizer. I think it would be good to have functions in the OperatorPlan that support these operations, that will reduce the chances of bugs and also make the code more readable. New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914145#action_12914145 ] Yan Zhou commented on PIG-1635: --- test-patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed Key: PIG-1635 URL: https://issues.apache.org/jira/browse/PIG-1635 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor Fix For: 0.8.0 Attachments: PIG-1635.patch b = FILTER a by (( f1 1) AND (1 == 1)) or b = FILTER a by ((f1 1) OR ( 1==0)) should be simplified to b = FILTER a by f1 1; Regarding ordering change, an example is that b = filter a by ((f1 is not null) AND (f2 is not null)); Even without possible simplification, the expression is changed to b = filter a by ((f2 is not null) AND (f1 is not null)); Even though the ordering change in this case, and probably in most other cases, does not create any difference, but for two reasons some users might care about the ordering: if stateful UDFs are used as operands of AND or OR; and if the ordering is intended by the application designer to maximize the chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914147#action_12914147 ] Daniel Dai commented on PIG-1644: - Yes, I think we can do replace/remove/insert. They should be simple and clear enough to use. Here is the new methods adding to OperatorPlan: {code} replace(Operator oldOperator, Operator newOperator) remove(Operator operatorToRemove) // Connect all its successors to predecessor/connect all it's predecessors to successor insertBefore(Operator operatorToInsert, Operator pos) // Insert operatorToInsert before pos, connect all pos's predecessors to operatorToInsert insertAfter(Operator operatorToInsert, Operator pos) // Insert operatorToInsert after pos, connect operatorToInsert to all pos's successor {code} How does it sounds? New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914150#action_12914150 ] Yan Zhou commented on PIG-1635: --- All test-core tests also run clean. Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed Key: PIG-1635 URL: https://issues.apache.org/jira/browse/PIG-1635 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor Fix For: 0.8.0 Attachments: PIG-1635.patch b = FILTER a by (( f1 1) AND (1 == 1)) or b = FILTER a by ((f1 1) OR ( 1==0)) should be simplified to b = FILTER a by f1 1; Regarding ordering change, an example is that b = filter a by ((f1 is not null) AND (f2 is not null)); Even without possible simplification, the expression is changed to b = filter a by ((f2 is not null) AND (f1 is not null)); Even though the ordering change in this case, and probably in most other cases, does not create any difference, but for two reasons some users might care about the ordering: if stateful UDFs are used as operands of AND or OR; and if the ordering is intended by the application designer to maximize the chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1639: Summary: New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF (was: New logical plan: PushUpFilter should not optimize if filter condition contains UDF) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1639-1.patch The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) 1; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914154#action_12914154 ] Daniel Dai commented on PIG-1639: - +1 if all tests pass. New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1639-1.patch The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) 1; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914167#action_12914167 ] Thejas M Nair commented on PIG-1644: I think insertAsPredecessor and insertAsSuccessor (instead of insertBefore and insertAfter) will convey the idea of what it does a little better. New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1632) The core jar in the tarball contains the kitchen sink
[ https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1632: Status: Resolved (was: Patch Available) Resolution: Fixed The core jar in the tarball contains the kitchen sink -- Key: PIG-1632 URL: https://issues.apache.org/jira/browse/PIG-1632 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0, 0.9.0 Reporter: Eli Collins Assignee: Eli Collins Fix For: site, 0.9.0 Attachments: pig-1632-1.patch, pig-1632-2.patch The core jar in the tarball contains the kitchen sink, it's not the same core jar built by ant jar. This is problematic since other projects that want to depend on the pig core jar just want pig core, but pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff (hadoop, com.google, commons, etc) that may conflict with the packages also on a user's classpath. {noformat} pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 12 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz ... pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 4819 {noformat} How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1643: --- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Tests passed. Patch committed to 0.8 branch and trunk. join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914317#action_12914317 ] Daniel Dai commented on PIG-1644: - After looking into the existing code, seems insertBetween is a more useful method. So I want to drop insertBefore/insertAfter, and add insertBetween {code} insertBetween(Operator pred, Operator operatorToInsert, Operator succ) {code} New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-2.patch Attach the patch with new methods and refactory of existing code. New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch, PIG-1644-2.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.