[jira] Updated: (PIG-1293) pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set
[ https://issues.apache.org/jira/browse/PIG-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated PIG-1293: -- Status: Open (was: Patch Available) pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set - Key: PIG-1293 URL: https://issues.apache.org/jira/browse/PIG-1293 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Allen Wittenauer Attachments: PIG-1293.txt If PIG_HOME isn't set and pig is in the path, the pig wrapper script can't find its home. Setting PIG_HOME makes it hard to support multiple versions of pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844611#action_12844611 ] Dmitriy V. Ryaboy commented on PIG-1292: Agreed with Xuefu's comment regarding the interfaces. This really seems like something we can just have the abstract func default to false. Method name suggestion: how about hasKeyToSplitAffinity() Interface Refinements - Key: PIG-1292 URL: https://issues.apache.org/jira/browse/PIG-1292 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-interfaces.patch A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1290) WeightedRangePartitioner should not check if input is empty if quantile file is empty
[ https://issues.apache.org/jira/browse/PIG-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1290: Status: Open (was: Patch Available) WeightedRangePartitioner should not check if input is empty if quantile file is empty - Key: PIG-1290 URL: https://issues.apache.org/jira/browse/PIG-1290 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1290.patch Currently WeightedRangePartitioner checks if the input is also empty if the quantile file is empty. For this it tries to read the input (which under the covers will result in creating splits for the input etc). If the input is a directory with many files, this could result in many calls to the namenode from each task - this can be avoided. If the input is non empty and quantile file is empty, then we would error out anyway (this should be confirmed). Also while fixing this jira we should ensure that pig can still do order by on empty input. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1290) WeightedRangePartitioner should not check if input is empty if quantile file is empty
[ https://issues.apache.org/jira/browse/PIG-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1290: Status: Patch Available (was: Open) Looks like the unit test failure was due to some other check in which has now got fixed - resubmitting WeightedRangePartitioner should not check if input is empty if quantile file is empty - Key: PIG-1290 URL: https://issues.apache.org/jira/browse/PIG-1290 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1290.patch Currently WeightedRangePartitioner checks if the input is also empty if the quantile file is empty. For this it tries to read the input (which under the covers will result in creating splits for the input etc). If the input is a directory with many files, this could result in many calls to the namenode from each task - this can be avoided. If the input is non empty and quantile file is empty, then we would error out anyway (this should be confirmed). Also while fixing this jira we should ensure that pig can still do order by on empty input. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-506: --- Labels: mentor gsoc (was: ) Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844633#action_12844633 ] Daniel Dai commented on PIG-366: Mark it to be a candidate project for Google summer of code 2010 program. Notes for GSOC 2010 applicants: 1. A good starting point for this project is Sigmod paper [Generating Example Data for Dataflow Programs|http://infolab.stanford.edu/~olston/publications/sigmod09.pdf] 2. Current code is out-dated and is no longer working. We need your help to bring this work up-to-date. PigPen - Eclipse plugin for a graphical PigLatin editor --- Key: PIG-366 URL: https://issues.apache.org/jira/browse/PIG-366 Project: Pig Issue Type: New Feature Reporter: Shubham Chopra Assignee: Shubham Chopra Priority: Minor Attachments: org.apache.pig.pigpen_0.0.1.jar, org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, pigpen.patch, pigPen.patch, PigPen.tgz This is an Eclipse plugin that provides a GUI that can help users create PigLatin scripts and see the example generator outputs on the fly and submit the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844640#action_12844640 ] Dmitriy V. Ryaboy commented on PIG-1292: .. but we have an abstract class that can provide default implementations so that implementers don't have to think about this. Most of the interfaces introduced in PIG-966 have significant chunks of functionality associated with them. This is just a single method about a particular property of the incoming data. I can see why you'd be against putting it into LoadFunc, though, as it's very specific. What about ResourceSchema or LoadMetaData? Interface Refinements - Key: PIG-1292 URL: https://issues.apache.org/jira/browse/PIG-1292 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-interfaces.patch A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1292: -- Attachment: pig-1292.patch Didn't get about LoadMetaData, ResourceSchema. LoadMetaData is one of those interfaces which loaders can choose to implement. ResourceSchema is independent class of its own. New patch incorporating suggested changes in the above comments. This patch also adds checks in the MRCompiler to enforce loader to implement new CollectableLoader interface if there is a map-side grouping ( PIG-984 ) in the script. Interface Refinements - Key: PIG-1292 URL: https://issues.apache.org/jira/browse/PIG-1292 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1292.patch, pig-interfaces.patch A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1292: -- Status: Patch Available (was: Open) Hudson is fickle recently. Hopefully, this patch gets lucky and is tested correctly. Interface Refinements - Key: PIG-1292 URL: https://issues.apache.org/jira/browse/PIG-1292 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1292.patch, pig-interfaces.patch A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Operating on Cogroups and Iterations in Pig Re: more bagging fun
Hmm, okay, I read the documentation further and it appears that this has already been discussed previously (herehttp://wiki.apache.org/pig/PigTypesFunctionalSpec).There seem to be a question of what's the right thing to do. It seems clear to me though. When an operation like '*' is applied, this is clearly an item-wise operation that is to be applied to each member of the bag. If a function is aggregate (SUM), then it operates across an entire bag. When a COGROUP occurs, just do what SQL does. Which is to say, perform cross join if an aggregate has been applied across several bags. And do so automatically, so we don't have to type out the separate FLATTEN's grouped = COGROUP employee BY name, bonuses BY name; flattened = FOREACH grouped GENERATE group, *FLATTEN(employee), FLATTEN(bonuses);grouped_again = GROUP flattened BY group; total_compensation = FOREACH grouped_again GENERATE group, SUM(employee:salary * bonuses:multiplier);* So this should do the same: grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, SUM(employee:salary * bonuses:multiplier); automatically, because that can only have one meaning. Alternatively, if it is desired to stay with a low-level language, the solution to all of this confusion around UDF's that take bag's and UDF's that operate on members of bags can be resolved if we do two things. 1.) Allow UDF's to actually become first class citizens. This way we can pass UDF's to other UDF's. 2.) introduce the concept of map() and reduce() operator over bags. This two things allows us more freedom and follows the paradigm of map-reducing more closely. grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, reduce(SUM,map(*,employee::salary,bonuses::multiplier)); Actually, this may deserve a separate keyword. Because map and reduce operate on single bags where as Pig introduces this concept of co-grouping, so we should have *comap *and *coreduce* that take functions and operate on multiple bags that are results of a *cogroup*. grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, REDUCE(SUM,COMAP(*, employee::salary,bonuses::multiplier)); This allows us to write efficiently, on one line, what would other wise be several aliases and unnecessary FLATTENed cross products. A second thing that I see is the recommendation of implementing looping constructs. I wonder if I may suggest, as a follow up to the above, that we beef up UDF's as first class citizens and add the ability to create UDF functions in Pig Latin with the ability to recurse. The reason why I think this is a better way to loop than *for(;;)* and * while(){}* and *do{}while()* statements is that recursive calls are functional and are more easily optimizable than imperative programming. The PigJournal http://wiki.apache.org/pig/PigJournal has an entry for all of these constructs and functions under the heading Extending Pig to Include Branching, Looping, and Functions, but because map-reduce paradigm is inherently functional, I would rather think that staying functional would be a better way to approach this improvement. So the minimal amount of additional features needed is to implement functions and branching and we would have loops as a side-effect of those improvements. In order for the optimizations to be available to PigLatin interpreter, the functions and branching *must* be implemented within the Pig system. If it is externalized, or implemented as UDL of some other language, then opportunities for optimization of the execution vanishes. Anyways, a couple of cents on a rainy day. On Wed, Mar 10, 2010 at 10:15 AM, hc busy hc.b...@gmail.com wrote: An additional thought... we can define udf's like ADD(bag{(int,int)}), DIVIDE(bag{(int,int)}), MULTIPLY(bag{(int,int)}), SQRT(bag{(float)}).. basically vectorize most of the common arithmetic operations, but then the language has to support it by converting bag.a + bag.b to ADD(bag.(a,b)) I guess there are some difficulties, for instance: SQRT(bag.a)+bag.b How would this work? because sqrt(bag.a) returns a bag, how would we convert it to the correct per tuple operation? It's almost like we want to convert an expression SUM(SQRT(bag.a),bag.b) into a function F such that SUM(SQRT(bag.a),bag.b) = F(bag.a,bag.b) and then the F is computed by iterating through on each tuple of the bag. FOREACH ... GENERATE ..., F(bag.(a,b)); On Wed, Mar 10, 2010 at 9:31 AM, hc busy hc.b...@gmail.com wrote: So, pig team, what is the right way to accomplish this? On Tue, Mar 9, 2010 at 10:50 PM, Mridul Muralidharan mrid...@yahoo-inc.com wrote: On Tuesday 09 March 2010 04:13 AM, hc busy wrote: okay. Here's the bag that I have: {group: (a: int,b: chararray,c: chararray,d: int), TABLE: {number1: int, number2:int}} and I want to do this grunt CALCULATE= FOREACH
Re: Operating on Cogroups and Iterations in Pig Re: more bagging fun
hc, Good stuff. I was thinking along very similar lines with regards to allowing mapping a function over a bag. I suspect a MAP can actually be written as a udf. We'd just have to pass the name of the function to be mapped and call InstantiateFuncFromSpec on it. We may want a different name for it, as map and reduce are associated with the Hadoop map and reduce stages when talking about Pig, and at some point Pig may want to allow users to explicitly set up map and reduce jobs -- as opposed to mapping functions to members of bags. -D On Fri, Mar 12, 2010 at 2:00 PM, hc busy hc.b...@gmail.com wrote: Hmm, okay, I read the documentation further and it appears that this has already been discussed previously (herehttp://wiki.apache.org/pig/PigTypesFunctionalSpec).There seem to be a question of what's the right thing to do. It seems clear to me though. When an operation like '*' is applied, this is clearly an item-wise operation that is to be applied to each member of the bag. If a function is aggregate (SUM), then it operates across an entire bag. When a COGROUP occurs, just do what SQL does. Which is to say, perform cross join if an aggregate has been applied across several bags. And do so automatically, so we don't have to type out the separate FLATTEN's grouped = COGROUP employee BY name, bonuses BY name; flattened = FOREACH grouped GENERATE group, *FLATTEN(employee), FLATTEN(bonuses);grouped_again = GROUP flattened BY group; total_compensation = FOREACH grouped_again GENERATE group, SUM(employee:salary * bonuses:multiplier);* So this should do the same: grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, SUM(employee:salary * bonuses:multiplier); automatically, because that can only have one meaning. Alternatively, if it is desired to stay with a low-level language, the solution to all of this confusion around UDF's that take bag's and UDF's that operate on members of bags can be resolved if we do two things. 1.) Allow UDF's to actually become first class citizens. This way we can pass UDF's to other UDF's. 2.) introduce the concept of map() and reduce() operator over bags. This two things allows us more freedom and follows the paradigm of map-reducing more closely. grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, reduce(SUM,map(*,employee::salary,bonuses::multiplier)); Actually, this may deserve a separate keyword. Because map and reduce operate on single bags where as Pig introduces this concept of co-grouping, so we should have *comap *and *coreduce* that take functions and operate on multiple bags that are results of a *cogroup*. grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, REDUCE(SUM,COMAP(*, employee::salary,bonuses::multiplier)); This allows us to write efficiently, on one line, what would other wise be several aliases and unnecessary FLATTENed cross products. A second thing that I see is the recommendation of implementing looping constructs. I wonder if I may suggest, as a follow up to the above, that we beef up UDF's as first class citizens and add the ability to create UDF functions in Pig Latin with the ability to recurse. The reason why I think this is a better way to loop than *for(;;)* and * while(){}* and *do{}while()* statements is that recursive calls are functional and are more easily optimizable than imperative programming. The PigJournal http://wiki.apache.org/pig/PigJournal has an entry for all of these constructs and functions under the heading Extending Pig to Include Branching, Looping, and Functions, but because map-reduce paradigm is inherently functional, I would rather think that staying functional would be a better way to approach this improvement. So the minimal amount of additional features needed is to implement functions and branching and we would have loops as a side-effect of those improvements. In order for the optimizations to be available to PigLatin interpreter, the functions and branching *must* be implemented within the Pig system. If it is externalized, or implemented as UDL of some other language, then opportunities for optimization of the execution vanishes. Anyways, a couple of cents on a rainy day. On Wed, Mar 10, 2010 at 10:15 AM, hc busy hc.b...@gmail.com wrote: An additional thought... we can define udf's like ADD(bag{(int,int)}), DIVIDE(bag{(int,int)}), MULTIPLY(bag{(int,int)}), SQRT(bag{(float)}).. basically vectorize most of the common arithmetic operations, but then the language has to support it by converting bag.a + bag.b to ADD(bag.(a,b)) I guess there are some difficulties, for instance: SQRT(bag.a)+bag.b How would this work? because sqrt(bag.a) returns a bag, how would we convert it to the correct per tuple operation? It's almost like we want to
[jira] Commented: (PIG-1292) Interface Refinements
[ https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844724#action_12844724 ] Xuefu Zhang commented on PIG-1292: -- Looking at the OrderedLoadFunc interface, public WritableComparable? getSplitComparable(InputSplit split, int splitIdx), I am not sure why split index suddenly comes into the picture. Though it was in earlier discussion between Pig and Zebra, we agree that this is very implementation specific, which shouldn't dictate API design. Thus, I don't think that split index should be in the signature even if it helps Zebra implementation. If an implementation needs the split index, it can always store the index in the split it generates. That's what exactly Zebra plan to do. Interface Refinements - Key: PIG-1292 URL: https://issues.apache.org/jira/browse/PIG-1292 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.7.0 Attachments: pig-1292.patch, pig-interfaces.patch A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both are abstract classes instead of being interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1295) Binary comparator for secondary sort
Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-1178: Status: Patch Available (was: Open) LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Ying He Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-1178: Attachment: pig_1178_3.patch LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Ying He Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1290) WeightedRangePartitioner should not check if input is empty if quantile file is empty
[ https://issues.apache.org/jira/browse/PIG-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844795#action_12844795 ] Hadoop QA commented on PIG-1290: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438556/PIG-1290.patch against trunk revision 922169. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/247/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/247/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/247/console This message is automatically generated. WeightedRangePartitioner should not check if input is empty if quantile file is empty - Key: PIG-1290 URL: https://issues.apache.org/jira/browse/PIG-1290 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1290.patch Currently WeightedRangePartitioner checks if the input is also empty if the quantile file is empty. For this it tries to read the input (which under the covers will result in creating splits for the input etc). If the input is a directory with many files, this could result in many calls to the namenode from each task - this can be avoided. If the input is non empty and quantile file is empty, then we would error out anyway (this should be confirmed). Also while fixing this jira we should ensure that pig can still do order by on empty input. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1290) WeightedRangePartitioner should not check if input is empty if quantile file is empty
[ https://issues.apache.org/jira/browse/PIG-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1290: Status: Patch Available (was: Open) Again there seem to be transient unrelated test failures - am resubmitting one more time - will also kick off a unit test run on my machine. WeightedRangePartitioner should not check if input is empty if quantile file is empty - Key: PIG-1290 URL: https://issues.apache.org/jira/browse/PIG-1290 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1290.patch Currently WeightedRangePartitioner checks if the input is also empty if the quantile file is empty. For this it tries to read the input (which under the covers will result in creating splits for the input etc). If the input is a directory with many files, this could result in many calls to the namenode from each task - this can be avoided. If the input is non empty and quantile file is empty, then we would error out anyway (this should be confirmed). Also while fixing this jira we should ensure that pig can still do order by on empty input. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.