[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887283#action_12887283 ] Ashutosh Chauhan commented on PIG-1249: --- Map-reduce framework has a jira related to this issue. https://issues.apache.org/jira/browse/MAPREDUCE-1521 It has two implications for Pig: 1) We need to reconsider whether we still want Pig to set number of reducers on user's behalf. We can choose not to intelligently choose # of reducers and let framework fail the job which doesn't correctly specify # of reducers. Then, Pig is out of this guessing game and users are forced by framework to correctly specify # of reducers. 2) Now that MR framework will fail the job based on configured limits, operators where Pig does compute and set number of reducers (like skewed join etc.) should now be aware of those limits so that # of reducers computed by them fall within those limits. Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-928: --- Status: Patch Available (was: Open) UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-928: --- Attachment: (was: RegisterPythonUDF2.patch) UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1490) Make Pig storers work with remote HDFS in secure mode
[ https://issues.apache.org/jira/browse/PIG-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887424#action_12887424 ] Daniel Dai commented on PIG-1490: - +1 Make Pig storers work with remote HDFS in secure mode - Key: PIG-1490 URL: https://issues.apache.org/jira/browse/PIG-1490 Project: Pig Issue Type: Bug Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0, 0.8.0 Attachments: PIG-1490.patch PIG-1403 fixed the problem for Pig loaders. We need to do the same for Pig storers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter
[ https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887425#action_12887425 ] Swati Jain commented on PIG-1494: - Reply from Yan Zhou: The filter logic split problem can be divided into 2 parts: 1) the filtering logic that can be applied to individual input sources; and 2) the filtering logic that has to be applied when merged(or joined) inputs are processed. The benefits for 1) are any benefits if the underlying storage supports predicate pushdown; plus the memory/CPU savings by PIG for not processing the unqualified rows. For 2), the purpose is not paying higher evaluation costs than necessary. For 1), no normal form is necessary. The original logical expression tree can be trimmed off any sub-expressions that are not constants nor just from a particular input source. The complexity is linear with the tree size; while the use of normal form could potentially lead to exponential complexity. The difficulty with this approach is how to generate the filtering logic for 2); while CNF can be used to easily figure out the logic for 2). However, the exact logic in 2) might not be cheaper to evaluate than the original logical expression. An example is Filter J2 by ((C1 10) AND (a3+b310)) OR ((C2 == 5) AND (a2+b2 5)). In 2) the filtering logic after CNF will be ((C1 10) OR (a2+b2 5)) AND ((a3+b310) OR (C2 == 5)) AND ((a3+b3 10) OR (a2+b2 5)). The cost will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced to 4, compared with 3 logical evaluations in the original form. In summary, if only 1) is desired, the tree trimming is enough. If 2) is desired too, then CNF could be used but its complexity should be controlled and the cost of the filtering logic evaluation in 2) should be computed and compared with the original expression evaluation cost. Further optimization is possible in this direction. Another potential optimization to consider is to support logical expression tree of multiple children vs. the binary tree after taking into consideration of the commutative property of OR and AND operations. The advantages are less tree traversal costs and easier to change the evaluation ordering within the same sub-tree in order to maximize the possibilities to short-cut the evaluations. Although this is general for all logical expressions, this tends to be more suitable for normal form handlings as normal forms group the sub-expressions by the operators that act on the sub-expressions. PIG Logical Optimization: Use CNF in PushUpFilter - Key: PIG-1494 URL: https://issues.apache.org/jira/browse/PIG-1494 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Swati Jain Priority: Minor Fix For: 0.8.0 The PushUpFilter rule is not able to handle complicated boolean expressions. For example, SplitFilter rule is splitting one LOFilter into two by AND. However it will not be able to split LOFilter if the top level operator is OR. For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5);* explain D; In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into Conjunctive Normal Form (CNF) then we would be able to push filter condition c1 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) *Suggestion:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887420#action_12887420 ] Daniel Dai commented on PIG-1295: - More clarification for custom Tuple. There two cases for custom tuple: 1. User create custom tuple inside UDF. In this case, we do not have a special serialized format for custom tuple. After serialization, we cannot tell if it is a custom tuple. That is say, we lose track of tuple implementation after se/des. Since serialized format is the same, we can still use the same raw comparator. 2. If user use a custom tuple factory (by overriding pig.data.tuple.factory.name), then serialized format may be changed. If we detect that tuple factory is not BinSedesTupleFactory, we shall not use this raw comparator. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-928: --- Status: Open (was: Patch Available) UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter
PIG Logical Optimization: Use CNF in PushUpFilter - Key: PIG-1494 URL: https://issues.apache.org/jira/browse/PIG-1494 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Swati Jain Priority: Minor Fix For: 0.8.0 The PushUpFilter rule is not able to handle complicated boolean expressions. For example, SplitFilter rule is splitting one LOFilter into two by AND. However it will not be able to split LOFilter if the top level operator is OR. For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5);* explain D; In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into Conjunctive Normal Form (CNF) then we would be able to push filter condition c1 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) *Suggestion:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1490) Make Pig storers work with remote HDFS in secure mode
[ https://issues.apache.org/jira/browse/PIG-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1490: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Release Note: Committed to both trunk and 0.7 branch Resolution: Fixed Make Pig storers work with remote HDFS in secure mode - Key: PIG-1490 URL: https://issues.apache.org/jira/browse/PIG-1490 Project: Pig Issue Type: Bug Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0, 0.7.0 Attachments: PIG-1490.patch PIG-1403 fixed the problem for Pig loaders. We need to do the same for Pig storers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: PIG Logical Optimization: Use CNF in SplitFilter
Yes, I already implemented the NOT push down upfront, so you do not need to do that. The support of CNF will probably be the most difficulty part. But as I mentioned last time, you should compare the cost after the trimming CNF to get the post-split filtering logic. Given the complexity of manipulating CNF and undetermined benefits, I am not sure it should be in scope at this moment or not. To handle CNF, I think it's a good idea to create a new plan and connect the nodes in the new plan to the base plan as you envisioned. In my changes, which uses DNF instead of CNF but processing is similar otherwise, I use a LogicalExpressionProxy, which contains a source member that is just the node in the original plan, to link the nodes in the new plan and old plan. The original LogicalExpression is enhanced with a counter to trace the # of proxies of the original nodes since normal form creation will spread the nodes in the original tree across many normalized nodes. The benefit, aside from not setting the plan, is that the original expression is trimmed according to the processing results from DNF; while DNF is created separately and as a kinda utility so that complex features can be used. In my changes, I used multiple-child tree in DNF while not changing the original binary expression tree structure. Another benefit is that the original tree is kept as much as it is at the start, i.e., I do not attempt to optimize its overall structure beyond trimming based upon the simplification logics. (I also control the size of DNF to 100 nodes.) The down side of this is added complexity. But in your case, for scenario 2 which is the whole point to use CNF, you would need to change the original expression tree structurally beyond trimming for post-split filtering logic. The other benefit of using multiple-child expression is depending upon if you plan to support such expression to replace current binary tree in the final plan. Even though I think it's a good idea to support that, but it is not in my scope now. I'll add my algorithm details soon to my jira. Please take a look and comment as you see appropriate. Thanks, Yan From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Friday, July 09, 2010 11:00 PM To: Yan Zhou Cc: pig-dev@hadoop.apache.org Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter Hi Yan, I agree that the first scenario (filter logic applied to individual input sources) doesn't need conversion to CNF and that it will be a good idea to add CNF functionality for the second scenario. I was also planning to provide a configurable threshold value to control the complexity of CNF conversion. As part of the above, I wrote a utility to push the NOT operator in predicates below the AND and OR operators (Scenario 2 in PIG-1399). I am considering making this utility to push NOT a separate rule in itself. Lmk if you have already implemented this. While implementing this utility I am facing some trouble in maintaining OperatorPlan consistent as I rewrite the expression. This is because each operator is referencing the main filter logical plan. Here is my current approach of implementation: 1. I am creating a new LogicalExpressionPlan for the converted boolean expression. 2. I am creating new logical expressions while pushing the NOT operation, converting AND into OR, OR into AND eliminating NOT NOT pairs. 3. However, I am having trouble updating the LogicalExpressionPlan if it reaches the base case ( i.e. root operator is not NOT,AND,OR). D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 10) AND (c3+b3 10 ) ) ) ); In the above, for example, I am not sure how to integrate base expression (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the plan for a given operator and its children. Also, there is currently no way to deepCopy an expression into a new OperatorPlan. It would be great if you could give me some suggestions on what approach to take for this. One approach I thought of is to visit the base expression and create and connect the base expression to the LogicalExpressionPlan as I visit it. Thoughts? Swati ps: About your other point regarding binary vs multi way trees, the way I am creating the normal form is a list of conjuncts, where each conjunct is a list of disjuncts. This is logically similar to a multi waytree. However, the current modeling of boolean expressions (modeled as binary expressions) requires a conversion back to the binary tree model when adding back to the main plan. On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou y...@yahoo-inc.com wrote: Swati, I happen to be working on the logical expression simplification effort (https://issues.apache.org/jira/browse/PIG-1399), but not on the filter split front. So I guess our interests will have some overlaps. I think the filter logic split problem can be divided into 2 parts: 1) the filtering logic that can be applied to individual input sources;
[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887441#action_12887441 ] Thejas M Nair commented on PIG-1472: bq. 1. The following code are never used in BinStorage and InterStorage, should be removed. I will remove that. bq. 3. Seems InterStorage is a replacement for BinStorage, why do we make it private? Shall we encourage user use InterStorage in the place of BinStorage, and make BinStorage deprecate? In future, we are likely to find better ways to serialize data between MR jobs of a pig query. ie the InterSedes serialization format is likely to change, and the change is not likely to be compatible with its old format. So it will not be suitable for storing persistent data. This replaces BinStorage only for its use within pig. Since BinStorage is used in pig queries and it should be easy to maintain the code, I think we don't have to deprecate BinStorage. Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1472: --- Attachment: PIG-1472.4.patch Removed unused static constants from InterStorage and BinStorage , addressing comment#1 from Daniel. Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1436) Print number of records outputted at each step of a Pig script
[ https://issues.apache.org/jira/browse/PIG-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887446#action_12887446 ] Richard Ding commented on PIG-1436: --- Russell, PIG-1478 implemented a callback mechanism that allows users to retrieve stats after each job. Will this meet your needs? Print number of records outputted at each step of a Pig script -- Key: PIG-1436 URL: https://issues.apache.org/jira/browse/PIG-1436 Project: Pig Issue Type: New Feature Components: grunt Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Richard Ding Priority: Minor Fix For: 0.8.0 I often run a script multiple times, or have to go and look through Hadoop task logs, to figure out where I broke a long script in such a way that I get 0 records out of it. I think this is a common problem. If someone can point me in the right direction, I can make a pass at this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-884) Have a way to export RulePlan and other kinds of OperatorPlan to common representaiton (dot?) and import from dot to RulePlan
[ https://issues.apache.org/jira/browse/PIG-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-884. Resolution: Fixed dot notation for explain was used as part of Pig 0.3.0 work. Have a way to export RulePlan and other kinds of OperatorPlan to common representaiton (dot?) and import from dot to RulePlan - Key: PIG-884 URL: https://issues.apache.org/jira/browse/PIG-884 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Pradeep Kamath Have a way to export RulePlan and other kinds of OperatorPlan to common representaiton (dot?) and import from dot to RulePlan -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-886) clone should be updated in LogicalOperators to include cloning of projection map information and any other information used by LogicalOptimizer
[ https://issues.apache.org/jira/browse/PIG-886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-886. Resolution: Fixed This is no longer relevant with the optimizer re-work clone should be updated in LogicalOperators to include cloning of projection map information and any other information used by LogicalOptimizer --- Key: PIG-886 URL: https://issues.apache.org/jira/browse/PIG-886 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Pradeep Kamath clone should be updated in LogicalOperators to include cloning of projection map information and any other information used by LogicalOptimizer -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY
[ https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-900: --- Fix Version/s: 0.9.0 ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY - Key: PIG-900 URL: https://issues.apache.org/jira/browse/PIG-900 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz Fix For: 0.9.0 With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c ); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-902) Allow schema matching for UDF with variable length arguments
[ https://issues.apache.org/jira/browse/PIG-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-902: --- Fix Version/s: 0.9.0 Allow schema matching for UDF with variable length arguments Key: PIG-902 URL: https://issues.apache.org/jira/browse/PIG-902 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Fix For: 0.9.0 Pig pick the right version of UDF using a similarity measurement. This mechanism pick the UDF with right input schema to use. However, some UDFs use various number of inputs and currently there is no way to declare such input schema in UDF and similarity measurement do not match against variable number of inputs. We can still write variable inputs UDF, but we cannot rely on schema matching to pick the right UDF version and do the automatic data type conversion. Eg: If we have: Integer udf1(Integer, ..); Integer udf1(String, ..); Currently we cannot do this: a: {chararray, chararray} b = foreach a generate udf1(a.$0, a.$1); // Pig cannot pick the udf(String, ..) automatically, currently, this statement fails Eg: If we have: Integer udf2(Integer, ..); Currently, this script fail a: {chararray, chararray} b = foreach a generate udf1(a.$0, a.$1); // Currently, Pig cannot convert a.$0 into Integer automatically -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-903) ILLUSTRATE fails on 'Distinct' operator
[ https://issues.apache.org/jira/browse/PIG-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-903: --- Fix Version/s: 0.9.0 ILLUSTRATE fails on 'Distinct' operator --- Key: PIG-903 URL: https://issues.apache.org/jira/browse/PIG-903 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.9.0 Using the latest Pig from trunk (0.3+) in mapreduce mode, running through the tutorial script script1-hadoop.pig works fine. However, executing the following illustrate command throws an exception: illustrate ngramed2 Pig Stack Trace --- ERROR 2999: Unexpected internal error. Unrecognized logical operator. java.lang.RuntimeException: Unrecognized logical operator. at org.apache.pig.pen.EquivalenceClasses.GetEquivalenceClasses(EquivalenceClasses.java:60) at org.apache.pig.pen.DerivedDataVisitor.evaluateOperator(DerivedDataVisitor.java:368) at org.apache.pig.pen.DerivedDataVisitor.visit(DerivedDataVisitor.java:226) at org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:104) at org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:37) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:98) at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:90) at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:106) at org.apache.pig.PigServer.getExamples(PigServer.java:724) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:541) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:361) This works: illustrate ngramed1; Although it does throw a few NPEs : java.lang.NullPointerException at org.apache.pig.pen.util.DisplayExamples.ShortenField(DisplayExamples.java:205) at org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190) at org.apache.pig.pen.util.DisplayExamples.PrintTabular(DisplayExamples.java:86) [...] (illustrate also doesn't work on bzipped input, but that's a separate issue) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-898) TextDataParser does not handle delimiters from one complex type in another
[ https://issues.apache.org/jira/browse/PIG-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-898. Fix Version/s: 0.7.0 Resolution: Fixed This has been addressed as part of 613 TextDataParser does not handle delimiters from one complex type in another -- Key: PIG-898 URL: https://issues.apache.org/jira/browse/PIG-898 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.4.0 Reporter: Santhosh Srinivasan Priority: Minor Fix For: 0.7.0 Currently, TextDataParser does not handle delimiters of one complex type in another. An example of such a case is key1(#value1} will not be parsed correctly. The production for strings matches any sequence of character that do not contain any delimiters for the complex types. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: PIG Logical Optimization: Use CNF in SplitFilter
I was wondering if you are not going to check in your patch soon then it would be great if you could share it with me. I believe I might be able to reuse some of your (utility) functionality directly or get some ideas. About your cost-benefit question: 1) I will control the complexity of CNF conversion by providing a configurable threshold value which will limit the OR-nesting. 2) One benefit of this conversion is that it will allow pushing parts of a filter (conjuncts) across the joins which is not happening in the current PushUpFilter optimization. Moreover, it may result in a cascading effect to push the conjuncts below other operators by other rules that may be fired as a result. The benefit from this is really data dependent, but in big-data workloads, any kind of predicate pushdown may eventually lead to big savings in amount of data read or amount of data transfered/shuffled across the network (I need to understand the LogicalPlan to PhysicalPlan conversion better to give concrete examples). Thanks! Swati On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou y...@yahoo-inc.com wrote: Yes, I already implemented the “NOT push down” upfront, so you do not need to do that. The support of CNF will probably be the most difficulty part. But as I mentioned last time, you should compare the cost after the trimming CNF to get the post-split filtering logic. Given the complexity of manipulating CNF and undetermined benefits, I am not sure it should be in scope at this moment or not. To handle CNF, I think it’s a good idea to create a new plan and connect the nodes in the new plan to the base plan as you envisioned. In my changes, which uses DNF instead of CNF but processing is similar otherwise, I use a LogicalExpressionProxy, which contains a “source” member that is just the node in the original plan, to link the nodes in the new plan and old plan. The original LogicalExpression is enhanced with a counter to trace the # of proxies of the original nodes since normal form creation will “spread” the nodes in the original tree across many normalized nodes. The benefit, aside from not setting the plan, is that the original expression is trimmed according to the processing results from DNF; while DNF is created separately and as a kinda utility so that complex features can be used. In my changes, I used multiple-child tree in DNF while not changing the original binary expression tree structure. Another benefit is that the original tree is kept as much as it is at the start, i.e., I do not attempt to optimize its overall structure beyond trimming based upon the simplification logics. (I also control the size of DNF to 100 nodes.) The down side of this is added complexity. But in your case, for scenario 2 which is the whole point to use CNF, you would need to change the original expression tree structurally beyond trimming for post-split filtering logic. The other benefit of using multiple-child expression is depending upon if you plan to support such expression to replace current binary tree in the final plan. Even though I think it’s a good idea to support that, but it is not in my scope now. I’ll add my algorithm details soon to my jira. Please take a look and comment as you see appropriate. Thanks, Yan -- *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu] *Sent:* Friday, July 09, 2010 11:00 PM *To:* Yan Zhou *Cc:* pig-dev@hadoop.apache.org *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter Hi Yan, I agree that the first scenario (filter logic applied to individual input sources) doesn't need conversion to CNF and that it will be a good idea to add CNF functionality for the second scenario. I was also planning to provide a configurable threshold value to control the complexity of CNF conversion. As part of the above, I wrote a utility to push the NOT operator in predicates below the AND and OR operators (Scenario 2 in PIG-1399). I am considering making this utility to push NOT a separate rule in itself. Lmk if you have already implemented this. While implementing this utility I am facing some trouble in maintaining OperatorPlan consistent as I rewrite the expression. This is because each operator is referencing the main filter logical plan. Here is my current approach of implementation: 1. I am creating a new LogicalExpressionPlan for the converted boolean expression. 2. I am creating new logical expressions while pushing the NOT operation, converting AND into OR, OR into AND eliminating NOT NOT pairs. 3. However, I am having trouble updating the LogicalExpressionPlan if it reaches the base case ( i.e. root operator is not NOT,AND,OR). D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 10) AND (c3+b3 10 ) ) ) ); In the above, for example, I am not sure how to integrate base expression (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the plan for a given
[jira] Commented: (PIG-914) Change the PIG hbase interface to use bytes along with strings
[ https://issues.apache.org/jira/browse/PIG-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887479#action_12887479 ] Olga Natkovich commented on PIG-914: ALex, are you still planning to work on this? Change the PIG hbase interface to use bytes along with strings -- Key: PIG-914 URL: https://issues.apache.org/jira/browse/PIG-914 Project: Pig Issue Type: Improvement Reporter: Alex Newman Priority: Minor Currently start rows, tablenames, column names are all strings, and HBase supports bytes we might want to change the Pig interface to support bytes along with strings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-916) Change the pig hbase interface to get more than one row at a time when scanning
[ https://issues.apache.org/jira/browse/PIG-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887480#action_12887480 ] Olga Natkovich commented on PIG-916: Alex, are you still planning to work on this? Change the pig hbase interface to get more than one row at a time when scanning --- Key: PIG-916 URL: https://issues.apache.org/jira/browse/PIG-916 Project: Pig Issue Type: Improvement Reporter: Alex Newman Priority: Trivial It should be significantly faster to get numerous rows at the same time rather than one row at a time for large table extraction processes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig
[ https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887476#action_12887476 ] Olga Natkovich commented on PIG-909: Did this actually get checked in? Should this be resurrected for Pig 0.8.0 or closed? Allow Pig executable to use hadoop jars not bundled with pig Key: PIG-909 URL: https://issues.apache.org/jira/browse/PIG-909 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Attachments: pig_909.patch The current pig executable (bin/pig) looks for a file named hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig. The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop jars, if that variable is set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-932) Required fields projection in Loader: nested fields in bag/tuple, map key lookup more than two levels
[ https://issues.apache.org/jira/browse/PIG-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-932: --- Fix Version/s: 0.8.0 Required fields projection in Loader: nested fields in bag/tuple, map key lookup more than two levels - Key: PIG-932 URL: https://issues.apache.org/jira/browse/PIG-932 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 To leverage the performance features provided by Zebra, Pig should be able to figure out which input fields are actually used in Pig script, and prune unnecessary inputs. This feature is being implementing in [PIG-922|https://issues.apache.org/jira/browse/PIG-922]. However, there are two limitations currently: 1. Pruning nested fields only apply to map. We do not prune sub-field inside a bag or tuple 2. For map, currently we only go one level deep. Eg, if in Pig script, user uses a#'key0'#'key1', a#'key0' will be asked These two limitations are in line with current limitation of Zebra loader. Once Zebra loader can handle this, we need to work to lift these limitations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-931) Samples Syntax Error in Pig UDF Manual
[ https://issues.apache.org/jira/browse/PIG-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-931: --- Assignee: Corinne Chandel Fix Version/s: 0.8.0 Samples Syntax Error in Pig UDF Manual -- Key: PIG-931 URL: https://issues.apache.org/jira/browse/PIG-931 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.2.0, 0.3.0 Environment: Windows XP, firefox 3.5.2 Reporter: Yiwei Chen Assignee: Corinne Chandel Priority: Trivial Fix For: 0.8.0 All samples with 'extends EvalFunc' have syntax errors in http://hadoop.apache.org/pig/docs/r0.3.0/udf.html . There shouldn't be parentheses; they are angle brackets. For example in How to Write a Simple Eval Function section: public class UPPER extends EvalFunc (String) should be public class UPPER extends EvalFuncString -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-930) merge join should handle compressed bz2 sorted files
[ https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-930: --- Fix Version/s: 0.8.0 Likely, this is no longer an issue in 0.7.0. Need to verify and add unit tests merge join should handle compressed bz2 sorted files Key: PIG-930 URL: https://issues.apache.org/jira/browse/PIG-930 Project: Pig Issue Type: Bug Reporter: Pradeep Kamath Fix For: 0.8.0 There are two issues - POLoad which is used to read the right side input does not handle bz2 files right now. This needs to be fixed. Further inn the index map job we bindTo(startOfBlockOffSet) (this will internally discard first tuple if offset 0). Then we do the following: {noformat} While(tuple survives pipeline) { Pos = getPosition() getNext() run the tuple through pipeline in the right side which could have filter } Emit(key, pos, filename). {noformat} Then in the map job which does the join, we bindTo(pos 0 ? pos 1 : pos) (we do pos -1 because bindTo will discard first tuple for pos 0). Then we do getNext() Now in bz2 compressed files, getPosition() returns a position which is not really accurate. The problem is it could be a position in the middle of a compressed bz2 block. Then when we use that position to bindTo() in the final map job, the code would first hunt for a bz2 block header thus skipping the whole current bz2 block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-932) Required fields projection in Loader: nested fields in bag/tuple, map key lookup more than two levels
[ https://issues.apache.org/jira/browse/PIG-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-932: --- Assignee: Daniel Dai Possible work for 0.8.0. Need to see if we have time Required fields projection in Loader: nested fields in bag/tuple, map key lookup more than two levels - Key: PIG-932 URL: https://issues.apache.org/jira/browse/PIG-932 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 To leverage the performance features provided by Zebra, Pig should be able to figure out which input fields are actually used in Pig script, and prune unnecessary inputs. This feature is being implementing in [PIG-922|https://issues.apache.org/jira/browse/PIG-922]. However, there are two limitations currently: 1. Pruning nested fields only apply to map. We do not prune sub-field inside a bag or tuple 2. For map, currently we only go one level deep. Eg, if in Pig script, user uses a#'key0'#'key1', a#'key0' will be asked These two limitations are in line with current limitation of Zebra loader. Once Zebra loader can handle this, we need to work to lift these limitations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-947) Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple.
[ https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-947: --- Fix Version/s: 0.8.0 Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple. Key: PIG-947 URL: https://issues.apache.org/jira/browse/PIG-947 Project: Pig Issue Type: Bug Components: data Environment: Pig on Hadoop 18 Reporter: Gandul Azul Fix For: 0.8.0 PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly: {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} while this is not: (Note the space before the second tuple) {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception. |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field being converted to type bag, caught ParseException Encountered STRING at |line 1, column 43. |Was expecting: |( ... | field discarded Below is the parser debug output for the parsing of the above error sequence: 2.071200,0), ( from above... ** FOUND A DOUBLENUMBER MATCH (2.071200) ** Call: AtomDatum Consumed token: DOUBLENUMBER: 2.071200 at line 1 column 31 Return: AtomDatum Return: Datum Matched the empty string as STRING token. Current character : , (44) at line 1 column 39 No more string literal token matches are possible. Currently matched the first 1 characters as a , token. ** FOUND A , MATCH (,) ** Consumed token: , at line 1 column 39 Call: Datum Matched the empty string as STRING token. Current character : 0 (48) at line 1 column 40 No string literal matches possible. Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : 0 (48) at line 1 column 40 Currently matched the first 1 characters as a SIGNEDINTEGER token. Possible kinds of longer matches : { STRING, SIGNEDINTEGER, DOUBLENUMBER, LONGINTEGER, FLOATNUMBER } Current character : ) (41) at line 1 column 41 Currently matched the first 1 characters as a SIGNEDINTEGER token. Putting back 1 characters into the input stream. ** FOUND A SIGNEDINTEGER MATCH (0) ** Call: AtomDatum Consumed token: SIGNEDINTEGER: 0 at line 1 column 40 Return: AtomDatum Return: Datum Matched the empty string as STRING token. Current character : ) (41) at line 1 column 41 No more string literal token matches are possible. Currently matched the first 1 characters as a ) token. ** FOUND A ) MATCH ()) ** Return: Tuple Consumed token: ) at line 1 column 41 Matched the empty string as STRING token. Current character : , (44) at line 1 column 42 No more string literal token matches are possible. Currently matched the first 1 characters as a , token. ** FOUND A , MATCH (,) ** Consumed token: , at line 1 column 42 Matched the empty string as STRING token. Current character : (32) at line 1 column 43 No string literal matches possible. Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : (32) at line 1 column 43 Currently matched the first 1 characters as a STRING token. Possible kinds of longer matches : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : ( (40) at line 1 column 44 Currently matched the first 1 characters as a STRING token. Putting back 1 characters into the input stream. ** FOUND A STRING MATCH ( ) ** Return: Bag Return: Datum Return: Parse -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-969) Default constructor of UDF gets called for UDF with parameterised constructor , if the udf has a getArgToFuncMapping function defined
[ https://issues.apache.org/jira/browse/PIG-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-969: --- Fix Version/s: 0.9.0 Description: This issue is discussed in http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg00524.html . I am able to reproduce the issue. While it is easy to fix the udf, it can take a lot of time to figure out the problem (until they find this email conversation!). The root cause is that when getArgToFuncMapping is defined in the udf , the FuncSpec returned by the method replaces one set by define statement . The constructor arguments get lost. We can handle this in following ways - 1. Preserve the constructor arguments, and use it with the class name of the matching FuncSpec from getArgToFuncMapping . 2. Give an error if constructor paramerters are given for a udf which has FuncSpecs returned from getArgToFuncMapping . The problem with approach 1 is that we are letting the user define the FuncSpec , so user could have defined a FuncSpec with constructor (though they don't have a valid reason to do so.). It is also possible the the constructor of the different class that matched might not support same constructor parameters. The use of this function outside builtin udfs are also probably not common. With option 2, we are telling the user that this is not a supported use case, and user can easily change the udf to fix the issue, or use the udf which would have matched given parameters (which unlikely to have the getArgToFuncMapping method defined). I am proposing that we go with option 2 . was: This issue is discussed in http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg00524.html . I am able to reproduce the issue. While it is easy to fix the udf, it can take a lot of time to figure out the problem (until they find this email conversation!). The root cause is that when getArgToFuncMapping is defined in the udf , the FuncSpec returned by the method replaces one set by define statement . The constructor arguments get lost. We can handle this in following ways - 1. Preserve the constructor arguments, and use it with the class name of the matching FuncSpec from getArgToFuncMapping . 2. Give an error if constructor paramerters are given for a udf which has FuncSpecs returned from getArgToFuncMapping . The problem with approach 1 is that we are letting the user define the FuncSpec , so user could have defined a FuncSpec with constructor (though they don't have a valid reason to do so.). It is also possible the the constructor of the different class that matched might not support same constructor parameters. The use of this function outside builtin udfs are also probably not common. With option 2, we are telling the user that this is not a supported use case, and user can easily change the udf to fix the issue, or use the udf which would have matched given parameters (which unlikely to have the getArgToFuncMapping method defined). I am proposing that we go with option 2 . Default constructor of UDF gets called for UDF with parameterised constructor , if the udf has a getArgToFuncMapping function defined - Key: PIG-969 URL: https://issues.apache.org/jira/browse/PIG-969 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Fix For: 0.9.0 This issue is discussed in http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg00524.html . I am able to reproduce the issue. While it is easy to fix the udf, it can take a lot of time to figure out the problem (until they find this email conversation!). The root cause is that when getArgToFuncMapping is defined in the udf , the FuncSpec returned by the method replaces one set by define statement . The constructor arguments get lost. We can handle this in following ways - 1. Preserve the constructor arguments, and use it with the class name of the matching FuncSpec from getArgToFuncMapping . 2. Give an error if constructor paramerters are given for a udf which has FuncSpecs returned from getArgToFuncMapping . The problem with approach 1 is that we are letting the user define the FuncSpec , so user could have defined a FuncSpec with constructor (though they don't have a valid reason to do so.). It is also possible the the constructor of the different class that matched might not support same constructor parameters. The use of this function outside builtin udfs are also probably not common. With option 2, we are telling the user that this is not a supported use case, and user can easily change the udf to fix the issue, or use the udf which would have matched given parameters (which unlikely to have the
[jira] Resolved: (PIG-1182) Pig reference manual does not mention syntax for comments
[ https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1182. - Resolution: Fixed Closing. If we do want to do an comprehansive index, please, create a separate JIRA Pig reference manual does not mention syntax for comments - Key: PIG-1182 URL: https://issues.apache.org/jira/browse/PIG-1182 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.5.0 Reporter: David Ciemiewicz The Pig 0.5.0 reference manual does not mention how to write comments in your pig code using -- (two dashes). http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html Also, does /* */ also work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-999) sorting on map-value fails if map-value is not of bytearray type
[ https://issues.apache.org/jira/browse/PIG-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-999: --- Fix Version/s: 0.9.0 sorting on map-value fails if map-value is not of bytearray type Key: PIG-999 URL: https://issues.apache.org/jira/browse/PIG-999 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Fix For: 0.9.0 When query execution plan is created by pig, it assumes the type to be bytearray because there is no schema information associated with map fields. But at run time, the loader might return the actual type. This results in a ClassCastException. This issue points to the larger issue of the way pig is handling types for map-value. This issue should be fixed in the context of revisiting the frontend logic and pig-latin semantics. This is related to PIG-880 . The patch in PIG-880 changed PigStorage to always return bytearray for map values to work around this, but other loaders like BinStorage can return the actual type causing this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-998) revisit frontend logic and pig-latin semantics
[ https://issues.apache.org/jira/browse/PIG-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-998: --- Fix Version/s: 0.9.0 revisit frontend logic and pig-latin semantics -- Key: PIG-998 URL: https://issues.apache.org/jira/browse/PIG-998 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Fix For: 0.9.0 This jira has been created to keep track of issues with current frontend logic and pig-latin semantics. One example is handling of type information of map-values. At time of query plan generation pig does not know the type for map-values and assumes it is bytearray. This leads to problems when the loader returns map-value of other types. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-967) Proposal for adding a metadata interface to Pig
[ https://issues.apache.org/jira/browse/PIG-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-967. Resolution: Won't Fix This is an obsolete proposal Proposal for adding a metadata interface to Pig --- Key: PIG-967 URL: https://issues.apache.org/jira/browse/PIG-967 Project: Pig Issue Type: Improvement Components: impl Reporter: Alan Gates Assignee: Alan Gates Pig needs to have an interface to connect to metadata systems. http://wiki.apache.org/pig/MetadataInterfaceProposal proposes and interface for this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's
[ https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1065: Fix Version/s: 0.9.0 In-determinate behaviour of Union when there are 2 non-matching schema's Key: PIG-1065 URL: https://issues.apache.org/jira/browse/PIG-1065 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.9.0 I have a script which first does a union of these schemas and then does a ORDER BY of this result. {code} f1 = LOAD '1.txt' as (key:chararray, v:chararray); f2 = LOAD '2.txt' as (key:chararray); u0 = UNION f1, f2; describe u0; dump u0; u1 = ORDER u0 BY $0; dump u1; {code} When I run in Map Reduce mode I get the following result: $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig Schema for u0 unknown. (1,2) (2,3) (1) (2) org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias u1 at org.apache.pig.PigServer.openIterator(PigServer.java:475) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) When I run the same script in local mode I get a different result, as we know that local mode does not use any Hadoop Classes. $java -cp pig.jar org.apache.pig.Main -x local broken.pig Schema for u0 unknown (1,2) (1) (2,3) (2) (1,2) (1) (2,3) (2) Here are some questions 1) Why do we allow union if the schemas do not match 2) Should we not print an error message/warning so that the user knows that this is not allowed or he can get unexpected results? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1066) ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null
[ https://issues.apache.org/jira/browse/PIG-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1066: Fix Version/s: 0.9.0 ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null Key: PIG-1066 URL: https://issues.apache.org/jira/browse/PIG-1066 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.4.0 Reporter: Bogdan Dorohonceanu Fix For: 0.9.0 -- load the QID_CT_QP20 data x = LOAD '$FS_TD/$QID_IN_FILES' USING PigStorage('\t') AS (unstem_qid:chararray, jid_score_pairs:chararray); DESCRIBE x; --ILLUSTRATE x; -- load the ID_RQ data y0 = LOAD '$FS_USER/$ID_RQ_IN_FILE' USING PigStorage('\t') AS (sid:chararray, query:chararray); -- force parallelization -- y1 = ORDER y0 BY sid PARALLEL $NUM; -- compute unstem_qid DEFINE f `text_streamer_query j3_unicode.dat prop.dat normal.txt TAB TAB 1:yes:UNSTEM_ID:%llx` INPUT(stdin USING PigStorage('\t')) OU\ TPUT(stdout USING PigStorage('\t')) SHIP('$USER/text_streamer_query', '$USER/j3_unicode.dat', '$USER/prop.dat', '$USER/normal.txt'); y = STREAM y0 THROUGH f AS (sid:chararray, query:chararray, unstem_qid:chararray); DESCRIBE y; --ILLUSTRATE y; rmf /user/vega/zoom/y_debug STORE y INTO '/user/vega/zoom/y_debug' USING PigStorage('\t'); 2009-10-30 13:36:48,437 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name 2009-10-30 13:36:48,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: dd-9c32d04:8889 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: dd-9c32d04:8889 2009-10-30 13:36:49,242 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null 09/10/30 13:36:49 ERROR grunt.Grunt: ERROR 2999: Unexpected internal error. null Details at logfile: /disk1/vega/zoom/pig_1256909801304.log -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1056) table can not be loaded after store
[ https://issues.apache.org/jira/browse/PIG-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1056. - Resolution: Invalid The script is invalid and that's why you see the error table can not be loaded after store --- Key: PIG-1056 URL: https://issues.apache.org/jira/browse/PIG-1056 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Pig Stack Trace --- ERROR 1018: Problem determining schema during load org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Problem determining schema during load at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1023) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967) at org.apache.pig.PigServer.registerQuery(PigServer.java:383) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem determining schema during load at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017) ... 8 more Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: Problem determining schema during load at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732) ... 10 more Caused by: java.io.IOException: No table specified for input at org.apache.hadoop.zebra.pig.TableLoader.checkConf(TableLoader.java:238) at org.apache.hadoop.zebra.pig.TableLoader.determineSchema(TableLoader.java:258) at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148) ... 11 more ~ script: register /grid/0/dev/hadoopqa/hadoop/lib/zebra.jar; A = load 'filter.txt' as (name:chararray, age:int); B = filter A by age 20; --dump B; store B into 'filter1' using org.apache.hadoop.zebra.pig.TableStorer('[name];[age]'); rec1 = load 'B' using org.apache.hadoop.zebra.pig.TableLoader(); dump rec1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1092) Pig Latin Parser fails to recognize \n as a whitespace
[ https://issues.apache.org/jira/browse/PIG-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1092: Fix Version/s: 0.9.0 Pig Latin Parser fails to recognize \n as a whitespace Key: PIG-1092 URL: https://issues.apache.org/jira/browse/PIG-1092 Project: Pig Issue Type: Bug Components: grunt Environment: RHEL linux Reporter: Yang Yang Priority: Minor Fix For: 0.9.0 the following pig latin script fails to parse a = load 'input_file' as ( field1 : int ); note that there is no char after the as, so there is only one \n char between the as and ( on the next line. adding a whitespace after as solves it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1112) FLATTEN eliminates the alias
[ https://issues.apache.org/jira/browse/PIG-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1112: Fix Version/s: 0.9.0 FLATTEN eliminates the alias Key: PIG-1112 URL: https://issues.apache.org/jira/browse/PIG-1112 Project: Pig Issue Type: Bug Reporter: Ankur Assignee: Daniel Dai Fix For: 0.9.0 If schema for a field of type 'bag' is partially defined then FLATTEN() incorrectly eliminates the field and throws an error. Consider the following example:- A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, ladder:bag{}); B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second; C = GROUP B by (first,third); This throws the error ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: third in {first: chararray,second: chararray} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1017: Fix Version/s: 0.9.0 Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Fix For: 0.9.0 Attachments: stotext.patch Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1152) bincond operator throws parser error
[ https://issues.apache.org/jira/browse/PIG-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1152: Fix Version/s: 0.9.0 bincond operator throws parser error Key: PIG-1152 URL: https://issues.apache.org/jira/browse/PIG-1152 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Fix For: 0.9.0 Bincond operator throws parser error when true condition contains a constant bag with 1 tuple containing a single field of int type with -ve value. Here is the script to reproduce the issue A = load 'A' as (s: chararray, x: int, y: int); B = group A by s; C = foreach B generate group, flatten(((COUNT(A) 1L) ? {(-1)} : A.x)); dump C; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1178: Fix Version/s: 0.8.0 LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1235) OptimizerException: Problem while rebuilding projection map or schema in logical optimizer
[ https://issues.apache.org/jira/browse/PIG-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1235. - Resolution: Won't Fix This is not relevant with new optimizer OptimizerException: Problem while rebuilding projection map or schema in logical optimizer -- Key: PIG-1235 URL: https://issues.apache.org/jira/browse/PIG-1235 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Here is the script that throws this exception: {code} A = load '1.txt' as (x, y, z); B = group A by (x 0 ? x : 0); C = filter B by group 10; explain C {code} Pig Stack Trace --- ERROR 2157: Error while fixing projections. No mapping available in old predecessor to replace column. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to explain alias C at org.apache.pig.PigServer.explain(PigServer.java:593) at org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:315) at org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:268) at org.apache.pig.tools.pigscript.parser.PigScriptParser.Explain(PigScriptParser.java:517) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:265) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:352) Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2145: Problem while rebuilding projection map or schema in logical optimizer. at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:215) at org.apache.pig.PigServer.compileLp(PigServer.java:856) at org.apache.pig.PigServer.compileLp(PigServer.java:792) at org.apache.pig.PigServer.getStorePlan(PigServer.java:734) at org.apache.pig.PigServer.explain(PigServer.java:576) ... 8 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error
[ https://issues.apache.org/jira/browse/PIG-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1247: Fix Version/s: 0.9.0 Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error - Key: PIG-1247 URL: https://issues.apache.org/jira/browse/PIG-1247 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.9.0 I have a large script in which there are intermediate stores statements, one of them writes to a directory I do not have permission to write to. The stack trace I get from Pig is this: 2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error Details at logfile: /home/viraj/pig_1266632145355.log Pig Stack Trace --- ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error java.lang.ClassCastException: org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986) at org.apache.pig.PigServer.registerQuery(PigServer.java:386) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:386) The only way to find the error was to look at the javacc generated QueryParser.java code and do a System.out.println() Here is a script to reproduce the problem: {code} A = load '/user/viraj/three.txt' using PigStorage(); B = foreach A generate ['a'#'12'] as b:map[] ; store B into '/user/secure/pigtest' using PigStorage(); {code} three.txt has 3 lines which contain nothing but the number 1. {code} $ hadoop fs -ls /user/secure/ ls: could not get get listing for 'hdfs://mynamenode/user/secure' : org.apache.hadoop.security.AccessControlException: Permission denied: user=viraj, access=READ_EXECUTE, inode=secure:secure:users:rwx-- {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1277) Pig should give error message when cogroup on tuple keys of different inner type
[ https://issues.apache.org/jira/browse/PIG-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1277: Fix Version/s: 0.9.0 Pig should give error message when cogroup on tuple keys of different inner type Key: PIG-1277 URL: https://issues.apache.org/jira/browse/PIG-1277 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Fix For: 0.9.0 When we cogroup on a tuple, if the inner type of tuple does not match, we treat them as different keys. This is confusing. It is desirable to give error/warnings when it happens. Here is one example: UDF: {code} public class MapGenerate extends EvalFuncMap { @Override public Map exec(Tuple input) throws IOException { // TODO Auto-generated method stub Map m = new HashMap(); m.put(key, new Integer(input.size())); return m; } @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(null, DataType.MAP)); } } {code} Pig script: {code} a = load '1.txt' as (a0); b = foreach a generate a0, MapGenerate(*) as m:map[]; c = foreach b generate a0, m#'key' as key; d = load '2.txt' as (c0, c1); e = cogroup c by (a0, key), d by (c0, c1); dump e; {code} 1.txt {code} 1 {code} 2.txt {code} 1 1 {code} User expected result (which is not right): {code} ((1,1),{(1,1)},{(1,1)}) {code} Real result: {code} ((1,1),{(1,1)},{}) ((1,1),{},{(1,1)}) {code} We shall give user the message that we can not merge the key due to the type mismatch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1319) New logical optimization rules
[ https://issues.apache.org/jira/browse/PIG-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1319: Fix Version/s: 0.8.0 New logical optimization rules -- Key: PIG-1319 URL: https://issues.apache.org/jira/browse/PIG-1319 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 In [PIG-1178|https://issues.apache.org/jira/browse/PIG-1178], we build a new logical optimization framework. One design goal for the new logical optimizer is to make it easier to add new logical optimization rules. In this Jira, we keep track of the development of these new logical optimization rules. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1328) pigtest ant target fails pigtrunk builds
[ https://issues.apache.org/jira/browse/PIG-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1328. - Resolution: Fixed I believe all tests are running now. Please, re-open and clarify if this is still an issue pigtest ant target fails pigtrunk builds Key: PIG-1328 URL: https://issues.apache.org/jira/browse/PIG-1328 Project: Pig Issue Type: Bug Components: build Reporter: Giridharan Kesavan java.lang.NoClassDefFoundError:com_cenqua_clover/CloverVersionInfo) [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.154 sec [junit] Test org.apache.hadoop.zebra.pig.TestTableSortStorer FAILED [junit] Running org.apache.hadoop.zebra.pig.TestTableSortStorerDesc [junit] log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration). [junit] log4j:WARN Please initialize the log4j system properly. [junit] [CLOVER] FATAL ERROR: Clover could not be initialised. Are you sure you have Clover in the runtime classpath? (class java.lang.NoClassDefFoundError:com_cenqua_clover/CloverVersionInfo) [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.164 sec [junit] Test org.apache.hadoop.zebra.pig.TestTableSortStorerDesc FAILED -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1188) Padding nulls to the input tuple according to input schema
[ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1188: Fix Version/s: 0.9.0 Padding nulls to the input tuple according to input schema -- Key: PIG-1188 URL: https://issues.apache.org/jira/browse/PIG-1188 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.9.0 Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example: Pig script: {code} a = load '1.txt' as (a0, a1); dump a; {code} Input file: {code} 1 2 1 2 3 1 {code} Current result: {code} (1,2) (1,2,3) (1) {code} Desired result: {code} (1,2) (1,2) (1, null) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1452: Fix Version/s: 0.8.0 to remove hadoop20.jar from lib and use hadoop from the apache maven repo. -- Key: PIG-1452 URL: https://issues.apache.org/jira/browse/PIG-1452 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.8.0 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: PIG-1452.PATCH pig use ivy for dependency management. But still it uses hadoop20.jar from the lib folder. Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1387) Syntactical Sugar for PIG-1385
[ https://issues.apache.org/jira/browse/PIG-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1387: Fix Version/s: 0.9.0 Syntactical Sugar for PIG-1385 -- Key: PIG-1387 URL: https://issues.apache.org/jira/browse/PIG-1387 Project: Pig Issue Type: Wish Components: grunt Affects Versions: 0.6.0 Reporter: hc busy Fix For: 0.9.0 From this conversation, extend PIG-1385 to instead of calling UDF use built-in behavior when the (),{},[] groupings are encountered. What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm() + ' using PigStorage(',') + as (id:int, a:chararray, b:chararray, c:chararray, d:chararray);); pigServer.registerQuery(last =
[jira] Updated: (PIG-1341) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED
[ https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1341: Fix Version/s: 0.9.0 BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED -- Key: PIG-1341 URL: https://issues.apache.org/jira/browse/PIG-1341 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.9.0 Attachments: PIG-1341.patch Script reads in BinStorage data and tries to convert a column which is in DataByteArray to Chararray. {code} raw = load 'sampledata' using BinStorage() as (col1,col2, col3); --filter out null columns A = filter raw by col1#'bcookie' is not null; B = foreach A generate col1#'bcookie' as reqcolumn; describe B; --B: {regcolumn: bytearray} X = limit B 5; dump X; B = foreach A generate (chararray)col1#'bcookie' as convertedcol; describe B; --B: {convertedcol: chararray} X = limit B 5; dump X; {code} The first dump produces: (36co9b55onr8s) (36co9b55onr8s) (36hilul5oo1q1) (36hilul5oo1q1) (36l4cj15ooa8a) The second dump produces: () () () () () It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s). Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1358) [piggybank] String functions should handle exceptions in a consistent manner
[ https://issues.apache.org/jira/browse/PIG-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1358: Fix Version/s: 0.9.0 [piggybank] String functions should handle exceptions in a consistent manner - Key: PIG-1358 URL: https://issues.apache.org/jira/browse/PIG-1358 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Fix For: 0.9.0 The String functions in piggybank handles exceptions differently. Some catches all exceptions, some catches only ClassCastException, while some catches only ExecException. The exception handling code in these functions should be consistent. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1399: Fix Version/s: 0.8.0 Logical Optimizer: Expression optimizor rule Key: PIG-1399 URL: https://issues.apache.org/jira/browse/PIG-1399 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Yan Zhou Fix For: 0.8.0 We can optimize expression in several ways: 1. Constant pre-calculation Example: B = filter A by a0 5+7; = B = filter A by a0 12; 2. Boolean expression optimization Example: B = filter A by not (not(a05) or a10); = B = filter A by a05 and a=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1459) Need a standard way to communicate the requested fields between front and back end for loaders
[ https://issues.apache.org/jira/browse/PIG-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1459: Fix Version/s: 0.9.0 Need a standard way to communicate the requested fields between front and back end for loaders -- Key: PIG-1459 URL: https://issues.apache.org/jira/browse/PIG-1459 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Alan Gates Fix For: 0.9.0 Pig currently provides no mechanism for loader writers to communicate which fields have been requested between the front and back end. Since any loader that accepts pushed projections has to deal with this issue it would make sense for Pig to provide a standard mechanism for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1477) Syntax error in tutorial Pig Script 1: Query Phrase Popularity (ORDER operator)
[ https://issues.apache.org/jira/browse/PIG-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1477: Assignee: Corinne Chandel Syntax error in tutorial Pig Script 1: Query Phrase Popularity (ORDER operator) --- Key: PIG-1477 URL: https://issues.apache.org/jira/browse/PIG-1477 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Brian Mansell Assignee: Corinne Chandel Priority: Trivial Fix For: 0.8.0 Documentation syntax should reflect the correct code indicated in the tutorial script. Documentation syntax {code} ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score); {code} Above syntax results in this error: {code} 2010-06-30 22:12:16,412 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 1, column 64. Was expecting: ) .. {code} (Correct) Tutorial script syntax {code} ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1436) Print number of records outputted at each step of a Pig script
[ https://issues.apache.org/jira/browse/PIG-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1436. - Resolution: Duplicate This looks like duplicate of PIG-1478. Please, re-open if this is not the case Print number of records outputted at each step of a Pig script -- Key: PIG-1436 URL: https://issues.apache.org/jira/browse/PIG-1436 Project: Pig Issue Type: New Feature Components: grunt Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Richard Ding Priority: Minor Fix For: 0.8.0 I often run a script multiple times, or have to go and look through Hadoop task logs, to figure out where I broke a long script in such a way that I get 0 records out of it. I think this is a common problem. If someone can point me in the right direction, I can make a pass at this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1465) Filter inside foreach is broken
[ https://issues.apache.org/jira/browse/PIG-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1465: Fix Version/s: 0.8.0 Filter inside foreach is broken --- Key: PIG-1465 URL: https://issues.apache.org/jira/browse/PIG-1465 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: hc busy Fix For: 0.8.0 {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b\{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; \} describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)
[ https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1470. - Resolution: Won't Fix Closing since there is no fix in Pig required. Feel gree to continue the discussion on the mailing lists. map/red jobs fail using G1 GC (Couldn't find heap) -- Key: PIG-1470 URL: https://issues.apache.org/jira/browse/PIG-1470 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07) Hadoop: 0.20.1 Reporter: Randy Prager Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails {noformat} property namemapred.child.java.opts/name value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC/value /property {noformat} Here is the hadoop map/red configuration that succeeds {noformat} property namemapred.child.java.opts/name value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops/value /property {noformat} Here is the exception from the pig script. {noformat} Backend error message - org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to set up the load function. at org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[,]' at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519) at org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85) ... 5 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487) ... 6 more Caused by: java.lang.RuntimeException: Couldn't find heap at org.apache.pig.impl.util.SpillableMemoryManager.init(SpillableMemoryManager.java:95) at org.apache.pig.data.BagFactory.init(BagFactory.java:106) at org.apache.pig.data.DefaultBagFactory.init(DefaultBagFactory.java:71) at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76) at org.apache.pig.builtin.Utf8StorageConverter.init(Utf8StorageConverter.java:49) at org.apache.pig.builtin.PigStorage.init(PigStorage.java:69) at org.apache.pig.builtin.PigStorage.init(PigStorage.java:79) ... 11 more {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1492) DefaultTuple and DefaultMemory understimate their memory footprint
[ https://issues.apache.org/jira/browse/PIG-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1492: Assignee: Thejas M Nair Fix Version/s: 0.8.0 DefaultTuple and DefaultMemory understimate their memory footprint -- Key: PIG-1492 URL: https://issues.apache.org/jira/browse/PIG-1492 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 There are several places where we highly underestimate the memory footprint . For example, for map datatypes, we don't account for the per entry cost for the map container data structures. The estimated size of a tuple having map with 100 integer key-value entries , as per current version of code is 3260 bytes, while what is observed is around 6775 bytes . To verify the memory footprint, i checked free memory before and after creating multiple instances of the object , using code on the lines of http://www.javaspecialists.eu/archive/Issue029.html . In PIG-1443 similar change was done to fix this for CHARARRAY . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-523) help in grunt should show all commands
[ https://issues.apache.org/jira/browse/PIG-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-523: -- Assignee: Olga Natkovich help in grunt should show all commands -- Key: PIG-523 URL: https://issues.apache.org/jira/browse/PIG-523 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Priority: Minor Fix For: 0.8.0 curently, it only show commands directly supported by grunt parser and not command supported by pig parser. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-347) Pig (help) Commands
[ https://issues.apache.org/jira/browse/PIG-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-347: -- Assignee: Olga Natkovich Pig (help) Commands --- Key: PIG-347 URL: https://issues.apache.org/jira/browse/PIG-347 Project: Pig Issue Type: Bug Reporter: Corinne Chandel Assignee: Olga Natkovich Priority: Minor Fix For: 0.8.0 Pig help can be specified 2 ways: $pig -help and $pig -h I. $pig -help (seen by external/internal users) (1) fix -c, -cluster clustername, kryptonite is default remove kryptonite is default (2) change -x, -exectype local|mapreduce, mapreduce is default change mapdreduce to hadoop (maintain backward compatibility) II. $pig -h (seen by internal users users only) (1) fix typos -l, --latest use latest, untested, unsupported version of pig.jar instaed of relased, tested, supported version. instead of released (2) fix -c, -cluster clustername, kryptonite is default remove kryptonite is default (same as above) (3) change: -x, -exectype local|mapreduce, mapreduce is default ... change mapdreduce to hadoop (maintain backward compatibility) (same as above) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter
[ https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887538#action_12887538 ] Olga Natkovich commented on PIG-1494: - Swati, I am assigning it to you since I am assuming you plan to work on it for 0.8. Otherwise, it is unlikely to happen in 0.8 timeframe. Feel free to unassign and unlink from this release if this is not the case PIG Logical Optimization: Use CNF in PushUpFilter - Key: PIG-1494 URL: https://issues.apache.org/jira/browse/PIG-1494 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Swati Jain Priority: Minor Fix For: 0.8.0 The PushUpFilter rule is not able to handle complicated boolean expressions. For example, SplitFilter rule is splitting one LOFilter into two by AND. However it will not be able to split LOFilter if the top level operator is OR. For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5);* explain D; In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into Conjunctive Normal Form (CNF) then we would be able to push filter condition c1 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) *Suggestion:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter
[ https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1494: --- Assignee: Swati Jain PIG Logical Optimization: Use CNF in PushUpFilter - Key: PIG-1494 URL: https://issues.apache.org/jira/browse/PIG-1494 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Swati Jain Assignee: Swati Jain Priority: Minor Fix For: 0.8.0 The PushUpFilter rule is not able to handle complicated boolean expressions. For example, SplitFilter rule is splitting one LOFilter into two by AND. However it will not be able to split LOFilter if the top level operator is OR. For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5);* explain D; In the above example, the PushUpFilter is not able to push any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into Conjunctive Normal Form (CNF) then we would be able to push filter condition c1 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) *Suggestion:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. We would also not require rule SplitFilter anymore if we were to add this utility to rule PushUpFilter itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887545#action_12887545 ] Daniel Dai commented on PIG-1472: - +1 for commit. Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1430) ISODateTime - DateTime: DateTime UDFs Should Also Support int/second Unix Times in All Operations
[ https://issues.apache.org/jira/browse/PIG-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887546#action_12887546 ] Alan Gates commented on PIG-1430: - I think it's fine to start with just putting conversion functions into Pig Latin. What I'd like to clarify though is what is the desired end state? Does Pig eventually have a datetime type that does all the datetime stuff you can dream of (timezones, etc.)? Or does Pig only ever have longs or strings to represent times and a set of functions to work with those? Are you proposing that latter, or delaying the former in interest of getting something into 0.8? ISODateTime - DateTime: DateTime UDFs Should Also Support int/second Unix Times in All Operations -- Key: PIG-1430 URL: https://issues.apache.org/jira/browse/PIG-1430 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Russell Jurney Fix For: 0.8.0 All functions in contrib.piggybank.java.src.main.java.org.apache.pig.piggybank.evaluation.datetime should seamlessly accept integer Unix/POSIX times, and return Unix time output when given an int, and ISO output when given a chararray. Note: Unix/POSIX times are the number of seconds elapsed since midnight proleptic Coordinated Universal Time (UTC) of January 1, 1970, not counting leap seconds. See http://en.wikipedia.org/wiki/Unix_time -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1495) Add -q command line option to set queue name for Pig jobs from command line
Add -q command line option to set queue name for Pig jobs from command line --- Key: PIG-1495 URL: https://issues.apache.org/jira/browse/PIG-1495 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.7.0 Reporter: Russell Jurney Fix For: 0.8.0 rjurney$ pig -q default This sets the mapred.job.queue.name property in the execution engine from the pig properties for MAPRED type jobs. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1321: Fix Version/s: 0.8.0 Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: PIG Logical Optimization: Use CNF in SplitFilter
Hopefully by this week. I'm still in the debugging phase of the work. While you are welcome to reuse some of my algorithms, I doubt you can reuse the code as much as you want. It's basically for my DNF use. You might need to factor out some general codes which you can find reusable. I fully understand the I/O benefits as I put in my first message. And it is classified as Scenario 1. There is no doubt that it should be considered as part of your work. However, for this, CNF is not necessary. For scenario 2, the benefits will be from less in-core logical expression evaluation costs and no I/O benefits as I can see. And use of CNF may or may not lead to cheaper evaluations as the example in my first message shows. In other words, after use of CNF, you should compare the eval cost with that in the original expression eval before deciding either the CNF or the original form should be evaluated. Please let me know if I miss any of your points. Thanks, Yan From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Monday, July 12, 2010 11:52 AM To: Yan Zhou Cc: pig-dev@hadoop.apache.org Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter I was wondering if you are not going to check in your patch soon then it would be great if you could share it with me. I believe I might be able to reuse some of your (utility) functionality directly or get some ideas. About your cost-benefit question: 1) I will control the complexity of CNF conversion by providing a configurable threshold value which will limit the OR-nesting. 2) One benefit of this conversion is that it will allow pushing parts of a filter (conjuncts) across the joins which is not happening in the current PushUpFilter optimization. Moreover, it may result in a cascading effect to push the conjuncts below other operators by other rules that may be fired as a result. The benefit from this is really data dependent, but in big-data workloads, any kind of predicate pushdown may eventually lead to big savings in amount of data read or amount of data transfered/shuffled across the network (I need to understand the LogicalPlan to PhysicalPlan conversion better to give concrete examples). Thanks! Swati On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou y...@yahoo-inc.com wrote: Yes, I already implemented the NOT push down upfront, so you do not need to do that. The support of CNF will probably be the most difficulty part. But as I mentioned last time, you should compare the cost after the trimming CNF to get the post-split filtering logic. Given the complexity of manipulating CNF and undetermined benefits, I am not sure it should be in scope at this moment or not. To handle CNF, I think it's a good idea to create a new plan and connect the nodes in the new plan to the base plan as you envisioned. In my changes, which uses DNF instead of CNF but processing is similar otherwise, I use a LogicalExpressionProxy, which contains a source member that is just the node in the original plan, to link the nodes in the new plan and old plan. The original LogicalExpression is enhanced with a counter to trace the # of proxies of the original nodes since normal form creation will spread the nodes in the original tree across many normalized nodes. The benefit, aside from not setting the plan, is that the original expression is trimmed according to the processing results from DNF; while DNF is created separately and as a kinda utility so that complex features can be used. In my changes, I used multiple-child tree in DNF while not changing the original binary expression tree structure. Another benefit is that the original tree is kept as much as it is at the start, i.e., I do not attempt to optimize its overall structure beyond trimming based upon the simplification logics. (I also control the size of DNF to 100 nodes.) The down side of this is added complexity. But in your case, for scenario 2 which is the whole point to use CNF, you would need to change the original expression tree structurally beyond trimming for post-split filtering logic. The other benefit of using multiple-child expression is depending upon if you plan to support such expression to replace current binary tree in the final plan. Even though I think it's a good idea to support that, but it is not in my scope now. I'll add my algorithm details soon to my jira. Please take a look and comment as you see appropriate. Thanks, Yan From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Friday, July 09, 2010 11:00 PM To: Yan Zhou Cc: pig-dev@hadoop.apache.org Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter Hi Yan, I agree that the first scenario (filter logic applied to individual input sources) doesn't need conversion to CNF and that it will be a good idea to add CNF functionality for the second scenario. I was also planning to provide a configurable threshold value to
[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1472: --- Status: Resolved (was: Patch Available) Resolution: Fixed Patch committed to trunk. Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1495) Add -q command line option to set queue name for Pig jobs from command line
[ https://issues.apache.org/jira/browse/PIG-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Jurney updated PIG-1495: Status: Patch Available (was: Open) Add -q command line option to set queue name for Pig jobs from command line --- Key: PIG-1495 URL: https://issues.apache.org/jira/browse/PIG-1495 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.7.0 Reporter: Russell Jurney Fix For: 0.8.0 Attachments: set_queue.patch rjurney$ pig -q default This sets the mapred.job.queue.name property in the execution engine from the pig properties for MAPRED type jobs. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1368) Utf8StorageConvertor's bytesToTuple and bytesToBag methods need to be tightened for corner cases
[ https://issues.apache.org/jira/browse/PIG-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1368. - Resolution: Duplicate This will be addressed as part of PIG-1271 Utf8StorageConvertor's bytesToTuple and bytesToBag methods need to be tightened for corner cases Key: PIG-1368 URL: https://issues.apache.org/jira/browse/PIG-1368 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Pradeep Kamath Consider the following data: 1\t ( hello , bye ) \n 1\t( hello , bye )a\n 2 \t (good , bye)\n The following script gives the results below: a = load 'junk' as (i:int, t:tuple(s:chararray, r:chararray)); dump a; (1,( hello , bye )) (1,( hello , bye )) (2,(good , bye)) The current bytesToTuple implementation discards leading and trailing characters before the tuple delimiters and parses the tuple out - I think instead it should treat any leading and trailing characters (including space) near the delimiters as an indication of a malformed tuple and return null. Also in the code, consumeBag() should handle the special case of {} and not delegate the handling to consumeTuple(). In consumeBag() null tuples should not be skipped. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1466) Improve log messages for memory usage
[ https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1466: --- Assignee: Thejas M Nair Thejas, can you update the messages since you are already looking at the memory stuff, thanks Improve log messages for memory usage - Key: PIG-1466 URL: https://issues.apache.org/jira/browse/PIG-1466 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Priority: Minor Fix For: 0.8.0 For anything more then a moderately sized dataset Pig usually spits following messages: {code} 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Usage threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 954466304(932096K) max = 954466304(932096K) 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Collection threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 954466304(932096K) max = 954466304(932096K) {code} This seems to confuse users a lot. Once these messages are printed, users tend to believe that Pig is having hard time with memory, is spilling to disk etc. but in fact Pig might be cruising along at ease. We should be little more careful what to print in logs. Currently these are printed when a notification is sent by JVM and some other conditions are met which may not necessarily indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these messages have lost their usefulness. At the every least, we should lower the log level at which these are printed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1495) Add -q command line option to set queue name for Pig jobs from command line
[ https://issues.apache.org/jira/browse/PIG-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Jurney updated PIG-1495: Status: Open (was: Patch Available) Add -q command line option to set queue name for Pig jobs from command line --- Key: PIG-1495 URL: https://issues.apache.org/jira/browse/PIG-1495 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.7.0 Reporter: Russell Jurney Fix For: 0.8.0 Attachments: set_queue.patch rjurney$ pig -q default This sets the mapred.job.queue.name property in the execution engine from the pig properties for MAPRED type jobs. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
PIG Logical Optimization: Use CNF in SplitFilter
Yan, What I meant in my last email was that scenario 2 optimizations would lead to more opportunities for scenario 1 kind of optimizations. Consider the conjunct list [C1;C2;C3] as the source of a JOIN. (a) Suppose none of these are computable on a join input, in this case we retain the original expression and discard the CNF. (b) Suppose C1 is computable on join input J1 and C2 is computable on join input J2 but C3 requires a combination of both join inputs. In this case, we push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note that C1 and C2 may be further pushed up (with additional iterations of the optimizer). If they are now the source of single input operators, it is similar to scenario 1. Thanks, Swati On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou y...@yahoo-inc.com wrote: Hopefully by this week. I’m still in the debugging phase of the work. While you are welcome to reuse some of my algorithms, I doubt you can reuse the code as much as you want. It’s basically for my DNF use. You might need to factor out some general codes which you can find reusable. I fully understand the I/O benefits as I put in my first message. And it is classified as “Scenario 1”. There is no doubt that it should be considered as part of your work. However, for this, CNF is not necessary. For scenario 2, the benefits will be from less in-core logical expression evaluation costs and no I/O benefits as I can see. And use of CNF may or may not lead to cheaper evaluations as the example in my first message shows. In other words, after use of CNF, you should compare the eval cost with that in the original expression eval before deciding either the CNF or the original form should be evaluated. Please let me know if I miss any of your points. Thanks, Yan -- *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu] *Sent:* Monday, July 12, 2010 11:52 AM *To:* Yan Zhou *Cc:* pig-dev@hadoop.apache.org *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter I was wondering if you are not going to check in your patch soon then it would be great if you could share it with me. I believe I might be able to reuse some of your (utility) functionality directly or get some ideas. About your cost-benefit question: 1) I will control the complexity of CNF conversion by providing a configurable threshold value which will limit the OR-nesting. 2) One benefit of this conversion is that it will allow pushing parts of a filter (conjuncts) across the joins which is not happening in the current PushUpFilter optimization. Moreover, it may result in a cascading effect to push the conjuncts below other operators by other rules that may be fired as a result. The benefit from this is really data dependent, but in big-data workloads, any kind of predicate pushdown may eventually lead to big savings in amount of data read or amount of data transfered/shuffled across the network (I need to understand the LogicalPlan to PhysicalPlan conversion better to give concrete examples). Thanks! Swati On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou y...@yahoo-inc.com wrote: Yes, I already implemented the “NOT push down” upfront, so you do not need to do that. The support of CNF will probably be the most difficulty part. But as I mentioned last time, you should compare the cost after the trimming CNF to get the post-split filtering logic. Given the complexity of manipulating CNF and undetermined benefits, I am not sure it should be in scope at this moment or not. To handle CNF, I think it’s a good idea to create a new plan and connect the nodes in the new plan to the base plan as you envisioned. In my changes, which uses DNF instead of CNF but processing is similar otherwise, I use a LogicalExpressionProxy, which contains a “source” member that is just the node in the original plan, to link the nodes in the new plan and old plan. The original LogicalExpression is enhanced with a counter to trace the # of proxies of the original nodes since normal form creation will “spread” the nodes in the original tree across many normalized nodes. The benefit, aside from not setting the plan, is that the original expression is trimmed according to the processing results from DNF; while DNF is created separately and as a kinda utility so that complex features can be used. In my changes, I used multiple-child tree in DNF while not changing the original binary expression tree structure. Another benefit is that the original tree is kept as much as it is at the start, i.e., I do not attempt to optimize its overall structure beyond trimming based upon the simplification logics. (I also control the size of DNF to 100 nodes.) The down side of this is added complexity. But in your case, for scenario 2 which is the whole point to use CNF, you would need to change the original expression tree structurally beyond trimming for post-split filtering
[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API
[ https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887578#action_12887578 ] Alan Gates commented on PIG-1478: - I don't understand the difference between launchStartedNotification() and jobsSubmittedNotification(). When will outputCompletedNotification() be called? Only after the job is completely done? What, if any, guarantees are we making on the order of this relative to when PigRunner.run returns? It isn't clear to me that launchCompleteNotification() is useful. Once the launch has completed the user will start getting jobStartedNotification() calls. Add progress notification listener to PigRunner API --- Key: PIG-1478 URL: https://issues.apache.org/jira/browse/PIG-1478 Project: Pig Issue Type: Improvement Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1478.patch PIG-1333 added PigRunner API to allow Pig users and tools to get a status/stats object back after executing a Pig script. The new API, however, is synchronous (blocking). It's known that a Pig script can spawn tens (even hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give progress feedback to the callers during the execution. The proposal is to add an optional parameter to the API: {code} public abstract class PigRunner { public static PigStats run(String[] args, PigProgressNotificationListener listener) {...} } {code} The new listener is defined as following: {code} package org.apache.pig.tools.pigstats; public interface PigProgressNotificationListener extends java.util.EventListener { // just before the launch of MR jobs for the script public void LaunchStartedNotification(int numJobsToLaunch); // number of jobs submitted in a batch public void jobsSubmittedNotification(int numJobsSubmitted); // a job is started public void jobStartedNotification(String assignedJobId); // a job is completed successfully public void jobFinishedNotification(JobStats jobStats); // a job is failed public void jobFailedNotification(JobStats jobStats); // a user output is completed successfully public void outputCompletedNotification(OutputStats outputStats); // updates the progress as percentage public void progressUpdatedNotification(int progress); // the script execution is done public void launchCompletedNotification(int numJobsSucceeded); } {code} Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1460) UDF manual and javadocs should make clear how to use RequiredFieldList
[ https://issues.apache.org/jira/browse/PIG-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1460: --- Assignee: Pradeep Kamath Pradeep, could you provide the information needed and also update the javadoc. Then, please, re-assign to Corinne so that she can update the UDF manual, thanks. UDF manual and javadocs should make clear how to use RequiredFieldList -- Key: PIG-1460 URL: https://issues.apache.org/jira/browse/PIG-1460 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Pradeep Kamath Priority: Minor Fix For: 0.8.0 The UDF manual mentions that load function writers need to handle RequiredFieldList passed to LoadPushDown.pushProjection, but it does not specify how the writer should interpret the contents of that list. The javadoc is similarly vague. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1373) We need to add jdiff output to docs on the website
[ https://issues.apache.org/jira/browse/PIG-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887582#action_12887582 ] Daniel Dai commented on PIG-1373: - All the changes are made, need to verify API changes link when 0.8 release. We need to add jdiff output to docs on the website -- Key: PIG-1373 URL: https://issues.apache.org/jira/browse/PIG-1373 Project: Pig Issue Type: Bug Reporter: Alan Gates Assignee: Daniel Dai Priority: Minor Fix For: 0.8.0 Attachments: PIG-1373-1.patch, PIG-1373-2.patch Our build process constructs a jdiff between APIs for different versions. But we don't post the results of that to the website when we deploy the docs. We should, in order to help users understand changes across versions of pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1496) Mandatory rule ImplicitSplitInserter
Mandatory rule ImplicitSplitInserter Key: PIG-1496 URL: https://issues.apache.org/jira/browse/PIG-1496 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 Need to migrate ImplicitSplitInserter to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1497) Mandatory rule PartitionFilterOptimizer
Mandatory rule PartitionFilterOptimizer --- Key: PIG-1497 URL: https://issues.apache.org/jira/browse/PIG-1497 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 Need to migrate PartitionFilterOptimizer to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1495) Add -q command line option to set queue name for Pig jobs from command line
[ https://issues.apache.org/jira/browse/PIG-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887585#action_12887585 ] Russell Jurney commented on PIG-1495: - This doesn't work yet. Doh! Add -q command line option to set queue name for Pig jobs from command line --- Key: PIG-1495 URL: https://issues.apache.org/jira/browse/PIG-1495 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.7.0 Reporter: Russell Jurney Fix For: 0.8.0 Attachments: set_queue.patch rjurney$ pig -q default This sets the mapred.job.queue.name property in the execution engine from the pig properties for MAPRED type jobs. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: PIG Logical Optimization: Use CNF in SplitFilter
I see. There looks like some disconnect about Scenario 1. To me, all filtering logics that can be pushed above JOIN can be figured out without use of CNF, which is scenario 1; while CNF helps to derive the filtering logic after (or, in your example, below) JOIN, which is Scenario 2. In your example, C1 and C2, or their equivalent, above JOIN can be easily figured out without resorting to CNF; C3 may have to be figured out with CNF, but evaluation cost of the post-Join filtering logic thus generated may not be cheaper than the original one before pushing up. In summary, if we want to support scenario 2(and 1), we should use CNF; if we JUST want to support scenario 1, which will push up all possible filters closer to source and have all benefits on pruned I/O, we should not use CNF. Thanks, Yan -Original Message- From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Monday, July 12, 2010 4:04 PM To: pig-dev@hadoop.apache.org Subject: PIG Logical Optimization: Use CNF in SplitFilter Yan, What I meant in my last email was that scenario 2 optimizations would lead to more opportunities for scenario 1 kind of optimizations. Consider the conjunct list [C1;C2;C3] as the source of a JOIN. (a) Suppose none of these are computable on a join input, in this case we retain the original expression and discard the CNF. (b) Suppose C1 is computable on join input J1 and C2 is computable on join input J2 but C3 requires a combination of both join inputs. In this case, we push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note that C1 and C2 may be further pushed up (with additional iterations of the optimizer). If they are now the source of single input operators, it is similar to scenario 1. Thanks, Swati On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou y...@yahoo-inc.com wrote: Hopefully by this week. I'm still in the debugging phase of the work. While you are welcome to reuse some of my algorithms, I doubt you can reuse the code as much as you want. It's basically for my DNF use. You might need to factor out some general codes which you can find reusable. I fully understand the I/O benefits as I put in my first message. And it is classified as Scenario 1. There is no doubt that it should be considered as part of your work. However, for this, CNF is not necessary. For scenario 2, the benefits will be from less in-core logical expression evaluation costs and no I/O benefits as I can see. And use of CNF may or may not lead to cheaper evaluations as the example in my first message shows. In other words, after use of CNF, you should compare the eval cost with that in the original expression eval before deciding either the CNF or the original form should be evaluated. Please let me know if I miss any of your points. Thanks, Yan -- *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu] *Sent:* Monday, July 12, 2010 11:52 AM *To:* Yan Zhou *Cc:* pig-dev@hadoop.apache.org *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter I was wondering if you are not going to check in your patch soon then it would be great if you could share it with me. I believe I might be able to reuse some of your (utility) functionality directly or get some ideas. About your cost-benefit question: 1) I will control the complexity of CNF conversion by providing a configurable threshold value which will limit the OR-nesting. 2) One benefit of this conversion is that it will allow pushing parts of a filter (conjuncts) across the joins which is not happening in the current PushUpFilter optimization. Moreover, it may result in a cascading effect to push the conjuncts below other operators by other rules that may be fired as a result. The benefit from this is really data dependent, but in big-data workloads, any kind of predicate pushdown may eventually lead to big savings in amount of data read or amount of data transfered/shuffled across the network (I need to understand the LogicalPlan to PhysicalPlan conversion better to give concrete examples). Thanks! Swati On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou y...@yahoo-inc.com wrote: Yes, I already implemented the NOT push down upfront, so you do not need to do that. The support of CNF will probably be the most difficulty part. But as I mentioned last time, you should compare the cost after the trimming CNF to get the post-split filtering logic. Given the complexity of manipulating CNF and undetermined benefits, I am not sure it should be in scope at this moment or not. To handle CNF, I think it's a good idea to create a new plan and connect the nodes in the new plan to the base plan as you envisioned. In my changes, which uses DNF instead of CNF but processing is similar otherwise, I use a
[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API
[ https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887602#action_12887602 ] Richard Ding commented on PIG-1478: --- bq. I don't understand the difference between launchStartedNotification() and jobsSubmittedNotification(). launchStartedNotification() tells the listeners the total number of jobs ready to submit for the script. jobsSubmittedNotification() tells the listeners the number of jobs submitted in a batch. Because of the dependency between jobs, Pig may not be able to submit all the jobs together. So the numJobsToLaunch passed to launchStartedNotification() should equal to the sum of numJobsSubmitted of all jobsSubmittedNotification() calls. bq. When will outputCompletedNotification() be called? Only after the job is completely done? What, if any, guarantees are we making on the order of this relative to when PigRunner.run returns? outputCompletedNotification() is called after the job that writes this output is done. This is only called for user outputs. As a script can have multiple user outputs, some outputs may be written before all jobs are done. bq. It isn't clear to me that launchCompleteNotification() is useful. Once the launch has completed the user will start getting jobStartedNotification() calls. Just try to be complete. launchCompleteNotification() is called when all jobs are done. If a script is executed successfully, the numJobsSucceeded should equal to the numJobsToLaunch from launchStartedNotification(). An example log trace looks like this: {code} numJobsToLaunch: 3 jobs submitted: 1 progress: 0% job started: job_20100702195434153_0002 progress: 16% progress: 33% job finished: job_20100702195434153_0002 jobs submitted: 1 job started: job_20100702195434153_0003 progress: 50% progress: 66% job finished: job_20100702195434153_0003 jobs submitted: 1 job started: job_20100702195434153_0004 progress: 83% output done: hdfs://localhost.localdomain:52083/user/pig/myoutput job finished: job_20100702195434153_0004 progress: 100% numJobsSucceeded: 3 {code} Add progress notification listener to PigRunner API --- Key: PIG-1478 URL: https://issues.apache.org/jira/browse/PIG-1478 Project: Pig Issue Type: Improvement Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1478.patch PIG-1333 added PigRunner API to allow Pig users and tools to get a status/stats object back after executing a Pig script. The new API, however, is synchronous (blocking). It's known that a Pig script can spawn tens (even hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give progress feedback to the callers during the execution. The proposal is to add an optional parameter to the API: {code} public abstract class PigRunner { public static PigStats run(String[] args, PigProgressNotificationListener listener) {...} } {code} The new listener is defined as following: {code} package org.apache.pig.tools.pigstats; public interface PigProgressNotificationListener extends java.util.EventListener { // just before the launch of MR jobs for the script public void LaunchStartedNotification(int numJobsToLaunch); // number of jobs submitted in a batch public void jobsSubmittedNotification(int numJobsSubmitted); // a job is started public void jobStartedNotification(String assignedJobId); // a job is completed successfully public void jobFinishedNotification(JobStats jobStats); // a job is failed public void jobFailedNotification(JobStats jobStats); // a user output is completed successfully public void outputCompletedNotification(OutputStats outputStats); // updates the progress as percentage public void progressUpdatedNotification(int progress); // the script execution is done public void launchCompletedNotification(int numJobsSucceeded); } {code} Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: PIG Logical Optimization: Use CNF in SplitFilter
Hi Yan Thanks for your prompt reply. I did not understand your statement “C1 and C2, or their equivalent, above JOIN can be easily figured out without resorting to CNF”. Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5) The schema for LOJoin: A = (a1:int,a2:int,a3:int); B = (b1:int,b2:int,b3:int); C = (c1:int,c2:int,c3:int); After CNF: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) Now we can push ( (c1 10) OR (c2 == 5) ) above the JOIN (in the branch leading up to the source C) while ( (a3+b3 10) OR (c2 ==5) ) stays put below the JOIN. Please let me know if there is a way of doing the above optimization without converting the original expression to CNF. Thanks, Swati On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou y...@yahoo-inc.com wrote: I see. There looks like some disconnect about Scenario 1. To me, all filtering logics that can be pushed above JOIN can be figured out without use of CNF, which is scenario 1; while CNF helps to derive the filtering logic after (or, in your example, below) JOIN, which is Scenario 2. In your example, C1 and C2, or their equivalent, above JOIN can be easily figured out without resorting to CNF; C3 may have to be figured out with CNF, but evaluation cost of the post-Join filtering logic thus generated may not be cheaper than the original one before pushing up. In summary, if we want to support scenario 2(and 1), we should use CNF; if we JUST want to support scenario 1, which will push up all possible filters closer to source and have all benefits on pruned I/O, we should not use CNF. Thanks, Yan -Original Message- From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Monday, July 12, 2010 4:04 PM To: pig-dev@hadoop.apache.org Subject: PIG Logical Optimization: Use CNF in SplitFilter Yan, What I meant in my last email was that scenario 2 optimizations would lead to more opportunities for scenario 1 kind of optimizations. Consider the conjunct list [C1;C2;C3] as the source of a JOIN. (a) Suppose none of these are computable on a join input, in this case we retain the original expression and discard the CNF. (b) Suppose C1 is computable on join input J1 and C2 is computable on join input J2 but C3 requires a combination of both join inputs. In this case, we push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note that C1 and C2 may be further pushed up (with additional iterations of the optimizer). If they are now the source of single input operators, it is similar to scenario 1. Thanks, Swati On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou y...@yahoo-inc.com wrote: Hopefully by this week. I'm still in the debugging phase of the work. While you are welcome to reuse some of my algorithms, I doubt you can reuse the code as much as you want. It's basically for my DNF use. You might need to factor out some general codes which you can find reusable. I fully understand the I/O benefits as I put in my first message. And it is classified as Scenario 1. There is no doubt that it should be considered as part of your work. However, for this, CNF is not necessary. For scenario 2, the benefits will be from less in-core logical expression evaluation costs and no I/O benefits as I can see. And use of CNF may or may not lead to cheaper evaluations as the example in my first message shows. In other words, after use of CNF, you should compare the eval cost with that in the original expression eval before deciding either the CNF or the original form should be evaluated. Please let me know if I miss any of your points. Thanks, Yan -- *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu] *Sent:* Monday, July 12, 2010 11:52 AM *To:* Yan Zhou *Cc:* pig-dev@hadoop.apache.org *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter I was wondering if you are not going to check in your patch soon then it would be great if you could share it with me. I believe I might be able to reuse some of your (utility) functionality directly or get some ideas. About your cost-benefit question: 1) I will control the complexity of CNF conversion by providing a configurable threshold value which will limit the OR-nesting. 2) One benefit of this conversion is that it will allow pushing parts of a filter (conjuncts) across the joins which is not happening in the current PushUpFilter optimization. Moreover, it may result in a cascading effect to push the conjuncts below other operators by other rules that may be fired as a result. The benefit from this is really data dependent, but in big-data workloads, any kind of predicate pushdown may eventually lead to big savings in amount of data read or amount of data
RE: PIG Logical Optimization: Use CNF in SplitFilter
In the original expression, let (a3+b3 10) to be true, then it transformed to (c1 10) OR (c2 == 5) ) since TRUE OR anything is still TRUE; TRUE and anything is that anything. You can write a visitor to easily do this type of partial evaluation. (a3+b310) is chosen because it can not be determined from alias 'C'. Thanks, Yan -Original Message- From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Monday, July 12, 2010 5:40 PM To: pig-dev@hadoop.apache.org Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter Hi Yan Thanks for your prompt reply. I did not understand your statement C1 and C2, or their equivalent, above JOIN can be easily figured out without resorting to CNF. Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1 10) AND (a3+b3 10) ) OR (c2 == 5) The schema for LOJoin: A = (a1:int,a2:int,a3:int); B = (b1:int,b2:int,b3:int); C = (c1:int,c2:int,c3:int); After CNF: ( (c1 10) OR (c2 == 5) ) AND ( (a3+b3 10) OR (c2 ==5) ) Now we can push ( (c1 10) OR (c2 == 5) ) above the JOIN (in the branch leading up to the source C) while ( (a3+b3 10) OR (c2 ==5) ) stays put below the JOIN. Please let me know if there is a way of doing the above optimization without converting the original expression to CNF. Thanks, Swati On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou y...@yahoo-inc.com wrote: I see. There looks like some disconnect about Scenario 1. To me, all filtering logics that can be pushed above JOIN can be figured out without use of CNF, which is scenario 1; while CNF helps to derive the filtering logic after (or, in your example, below) JOIN, which is Scenario 2. In your example, C1 and C2, or their equivalent, above JOIN can be easily figured out without resorting to CNF; C3 may have to be figured out with CNF, but evaluation cost of the post-Join filtering logic thus generated may not be cheaper than the original one before pushing up. In summary, if we want to support scenario 2(and 1), we should use CNF; if we JUST want to support scenario 1, which will push up all possible filters closer to source and have all benefits on pruned I/O, we should not use CNF. Thanks, Yan -Original Message- From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Monday, July 12, 2010 4:04 PM To: pig-dev@hadoop.apache.org Subject: PIG Logical Optimization: Use CNF in SplitFilter Yan, What I meant in my last email was that scenario 2 optimizations would lead to more opportunities for scenario 1 kind of optimizations. Consider the conjunct list [C1;C2;C3] as the source of a JOIN. (a) Suppose none of these are computable on a join input, in this case we retain the original expression and discard the CNF. (b) Suppose C1 is computable on join input J1 and C2 is computable on join input J2 but C3 requires a combination of both join inputs. In this case, we push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note that C1 and C2 may be further pushed up (with additional iterations of the optimizer). If they are now the source of single input operators, it is similar to scenario 1. Thanks, Swati On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou y...@yahoo-inc.com wrote: Hopefully by this week. I'm still in the debugging phase of the work. While you are welcome to reuse some of my algorithms, I doubt you can reuse the code as much as you want. It's basically for my DNF use. You might need to factor out some general codes which you can find reusable. I fully understand the I/O benefits as I put in my first message. And it is classified as Scenario 1. There is no doubt that it should be considered as part of your work. However, for this, CNF is not necessary. For scenario 2, the benefits will be from less in-core logical expression evaluation costs and no I/O benefits as I can see. And use of CNF may or may not lead to cheaper evaluations as the example in my first message shows. In other words, after use of CNF, you should compare the eval cost with that in the original expression eval before deciding either the CNF or the original form should be evaluated. Please let me know if I miss any of your points. Thanks, Yan -- *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu] *Sent:* Monday, July 12, 2010 11:52 AM *To:* Yan Zhou *Cc:* pig-dev@hadoop.apache.org *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter I was wondering if you are not going to check in your patch soon then it would be great if you could share it with me. I believe I might be able to reuse some of your (utility) functionality directly or get some ideas. About your cost-benefit question: 1) I will control the complexity of CNF conversion by providing a configurable threshold value which will limit the