[jira] Updated: (PIG-1293) pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set

2010-03-12 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated PIG-1293:
--

Status: Open  (was: Patch Available)

 pig wrapper script tends to fail if pig is in the path and PIG_HOME isn't set
 -

 Key: PIG-1293
 URL: https://issues.apache.org/jira/browse/PIG-1293
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Allen Wittenauer
 Attachments: PIG-1293.txt


 If PIG_HOME isn't set and pig is in the path, the pig wrapper script can't 
 find its home.  Setting PIG_HOME makes it hard to support multiple versions 
 of pig. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1292) Interface Refinements

2010-03-12 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844611#action_12844611
 ] 

Dmitriy V. Ryaboy commented on PIG-1292:


Agreed with Xuefu's comment regarding the interfaces. This really seems like 
something we can just have the abstract func default to false.

Method name suggestion: how about hasKeyToSplitAffinity() 


 Interface Refinements
 -

 Key: PIG-1292
 URL: https://issues.apache.org/jira/browse/PIG-1292
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-interfaces.patch


 A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
 are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1290) WeightedRangePartitioner should not check if input is empty if quantile file is empty

2010-03-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1290:


Status: Open  (was: Patch Available)

 WeightedRangePartitioner should not check if input is empty if quantile file 
 is empty
 -

 Key: PIG-1290
 URL: https://issues.apache.org/jira/browse/PIG-1290
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.7.0

 Attachments: PIG-1290.patch


 Currently WeightedRangePartitioner checks if the input is also empty if the 
 quantile file is empty. For this it tries to read the input (which under the 
 covers will result in creating splits for the input etc). If the input is a 
 directory with many files, this could result in many calls to the namenode 
 from each task - this can be avoided.
 If the input is non empty and quantile file is empty, then we would error out 
 anyway (this should be confirmed). Also while fixing this jira we should 
 ensure that pig can still do order by on empty input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1290) WeightedRangePartitioner should not check if input is empty if quantile file is empty

2010-03-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1290:


Status: Patch Available  (was: Open)

Looks like the unit test failure was due to some other check in which has now 
got fixed - resubmitting

 WeightedRangePartitioner should not check if input is empty if quantile file 
 is empty
 -

 Key: PIG-1290
 URL: https://issues.apache.org/jira/browse/PIG-1290
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.7.0

 Attachments: PIG-1290.patch


 Currently WeightedRangePartitioner checks if the input is also empty if the 
 quantile file is empty. For this it tries to read the input (which under the 
 covers will result in creating splits for the input etc). If the input is a 
 directory with many files, this could result in many calls to the namenode 
 from each task - this can be avoided.
 If the input is non empty and quantile file is empty, then we would error out 
 anyway (this should be confirmed). Also while fixing this jira we should 
 ensure that pig can still do order by on empty input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-506) Does pig need a NATIVE keyword?

2010-03-12 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-506:
---

Labels: mentor gsoc  (was: )

 Does pig need a NATIVE keyword?
 ---

 Key: PIG-506
 URL: https://issues.apache.org/jira/browse/PIG-506
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor

 Assume a user had a job that broke easily into three pieces.  Further assume 
 that pieces one and three were easily expressible in pig, but that piece two 
 needed to be written in map reduce for whatever reason (performance, 
 something that pig could not easily express, legacy job that was too 
 important to change, etc.).  Today the user would either have to use map 
 reduce for the entire job or manually handle the stitching together of pig 
 and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
 would allow the script to pass off the data stream to the underlying system 
 (in this case map reduce).  The semantics of NATIVE would vary by underlying 
 system.  In the map reduce case, we would assume that this indicated a 
 collection of one or more fully contained map reduce jobs, so that pig would 
 store the data, invoke the map reduce jobs, and then read the resulting data 
 to continue.  It might look something like this:
 {code}
 A = load 'myfile';
 X = load 'myotherfile';
 B = group A by $0;
 C = foreach B generate group, myudf(B);
 D = native (jar=mymr.jar, infile=frompig outfile=topig);
 E = join D by $0, X by $0;
 ...
 {code}
 This differs from streaming in that it allows the user to insert an arbitrary 
 amount of native processing, whereas streaming allows the insertion of one 
 binary.  It also differs in that, for streaming, data is piped directly into 
 and out of the binary as part of the pig pipeline.  Here the pipeline would 
 be broken, data written to disk, and the native block invoked, then data read 
 back from disk.
 Another alternative is to say this is unnecessary because the user can do the 
 coordination from java, using the PIgServer interface to run pig and calling 
 the map reduce job explicitly.  The advantages of the native keyword are that 
 the user need not be worried about coordination between the jobs, pig will 
 take care of it.  Also the user can make use of existing java applications 
 without being a java programmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-03-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844633#action_12844633
 ] 

Daniel Dai commented on PIG-366:


Mark it to be a candidate project for Google summer of code 2010 program. 

Notes for GSOC 2010 applicants:
1. A good starting point for this project is Sigmod paper [Generating Example 
Data for Dataflow 
Programs|http://infolab.stanford.edu/~olston/publications/sigmod09.pdf]
2. Current code is out-dated and is no longer working. We need your help to 
bring this work up-to-date.

 PigPen - Eclipse plugin for a graphical PigLatin editor
 ---

 Key: PIG-366
 URL: https://issues.apache.org/jira/browse/PIG-366
 Project: Pig
  Issue Type: New Feature
Reporter: Shubham Chopra
Assignee: Shubham Chopra
Priority: Minor
 Attachments: org.apache.pig.pigpen_0.0.1.jar, 
 org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
 pigpen.patch, pigPen.patch, PigPen.tgz


 This is an Eclipse plugin that provides a GUI that can help users create 
 PigLatin scripts and see the example generator outputs on the fly and submit 
 the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1292) Interface Refinements

2010-03-12 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844640#action_12844640
 ] 

Dmitriy V. Ryaboy commented on PIG-1292:


.. but we have an abstract class that can provide default implementations so 
that implementers don't have to think about this.

Most of the interfaces introduced in PIG-966 have significant chunks of 
functionality associated with them. This is just a single method about a 
particular property of the incoming data.
I can see why you'd be against putting it into LoadFunc, though, as it's very 
specific. What about ResourceSchema or LoadMetaData?

 Interface Refinements
 -

 Key: PIG-1292
 URL: https://issues.apache.org/jira/browse/PIG-1292
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-interfaces.patch


 A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
 are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1292) Interface Refinements

2010-03-12 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1292:
--

Attachment: pig-1292.patch

Didn't get about LoadMetaData, ResourceSchema. LoadMetaData is one of those 
interfaces which loaders can choose to implement. ResourceSchema is independent 
class of its own.

New patch incorporating suggested changes in the above comments. This patch 
also adds checks in the MRCompiler to enforce loader to implement new 
CollectableLoader interface if there is a map-side grouping ( PIG-984 ) in the 
script.

 Interface Refinements
 -

 Key: PIG-1292
 URL: https://issues.apache.org/jira/browse/PIG-1292
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1292.patch, pig-interfaces.patch


 A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
 are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1292) Interface Refinements

2010-03-12 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1292:
--

Status: Patch Available  (was: Open)

Hudson is fickle recently. Hopefully, this patch gets lucky and is tested 
correctly.

 Interface Refinements
 -

 Key: PIG-1292
 URL: https://issues.apache.org/jira/browse/PIG-1292
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1292.patch, pig-interfaces.patch


 A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
 are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Operating on Cogroups and Iterations in Pig Re: more bagging fun

2010-03-12 Thread hc busy
Hmm, okay, I read the documentation further and it appears that this has
already been discussed previously
(herehttp://wiki.apache.org/pig/PigTypesFunctionalSpec).There
seem to be a question of what's the right thing to do. It seems clear to me
though. When an operation like '*' is applied, this is clearly an item-wise
operation that is to be applied to each member of the bag. If a function is
aggregate (SUM), then it operates across an entire bag.

When a COGROUP occurs, just do what SQL does. Which is to say, perform cross
join if an aggregate has been applied across several bags. And do so
automatically, so we don't have to type out the separate FLATTEN's

grouped = COGROUP employee BY name, bonuses BY name;
flattened = FOREACH grouped GENERATE group, *FLATTEN(employee),
FLATTEN(bonuses);grouped_again = GROUP flattened BY group;
total_compensation = FOREACH grouped_again GENERATE group,
SUM(employee:salary * bonuses:multiplier);*

So this should do the same:

grouped = COGROUP employee BY name, bonuses BY name;
total_compensation = FOREACH grouped GENERATE group,
SUM(employee:salary * bonuses:multiplier);


automatically, because that can only have one meaning.

Alternatively, if it is desired to stay with a low-level language, the
solution to all of this confusion around UDF's that take bag's and UDF's
that operate on members of bags can be resolved if we do two things.

1.) Allow UDF's to actually become first class citizens. This way we can
pass UDF's to other UDF's.
2.) introduce the concept of map() and reduce() operator over bags.

This two things allows us more freedom and follows the paradigm of
map-reducing more closely.

grouped = COGROUP employee BY name, bonuses BY name;
total_compensation = FOREACH grouped GENERATE group,
reduce(SUM,map(*,employee::salary,bonuses::multiplier));


Actually, this may deserve a separate keyword. Because map and reduce
operate on single bags where as Pig introduces this concept of co-grouping,
so we should have *comap *and *coreduce* that take functions and operate on
multiple bags that are results of a *cogroup*.

grouped = COGROUP employee BY name, bonuses BY name;
total_compensation = FOREACH grouped GENERATE group,
REDUCE(SUM,COMAP(*, employee::salary,bonuses::multiplier));


This allows us to write efficiently, on one line, what would other wise be
several aliases and unnecessary FLATTENed cross products.

A second thing that I see is the recommendation of implementing looping
constructs. I wonder if I may suggest, as a follow up to the above, that we
beef up UDF's as first class citizens and add the ability to create UDF
functions in Pig Latin with the ability to recurse.

The reason why I think this is a better way to loop than *for(;;)* and *
while(){}* and *do{}while()* statements is that recursive calls are
functional and are more easily optimizable than imperative programming. The
PigJournal http://wiki.apache.org/pig/PigJournal has an entry for all of
these constructs and functions under the heading Extending Pig to Include
Branching, Looping, and Functions, but because map-reduce paradigm is
inherently functional, I would rather think that staying functional would be
a better way to approach this improvement. So the minimal amount of
additional features needed is to implement functions and branching and we
would have loops as a side-effect of those improvements.

In order for the optimizations to be available to PigLatin interpreter, the
functions and branching *must* be implemented within the Pig system. If it
is externalized, or implemented as UDL of some other language, then
opportunities for optimization of the execution vanishes.


Anyways, a couple of cents on a rainy day.




On Wed, Mar 10, 2010 at 10:15 AM, hc busy hc.b...@gmail.com wrote:

 An additional thought... we can define udf's like

 ADD(bag{(int,int)}), DIVIDE(bag{(int,int)}), MULTIPLY(bag{(int,int)}),
 SQRT(bag{(float)})..

 basically vectorize most of the common arithmetic operations, but then the
 language has to support it by converting

 bag.a + bag.b

 to

 ADD(bag.(a,b))

 I guess there are some difficulties, for instance:

 SQRT(bag.a)+bag.b

 How would this work? because sqrt(bag.a) returns a bag, how would we
 convert it to the correct per tuple operation? It's almost like we want to
 convert an expression

 SUM(SQRT(bag.a),bag.b)

 into a function F such that

 SUM(SQRT(bag.a),bag.b) = F(bag.a,bag.b)

 and then the F is computed by iterating through on each tuple of the bag.

 FOREACH ... GENERATE ..., F(bag.(a,b));






 On Wed, Mar 10, 2010 at 9:31 AM, hc busy hc.b...@gmail.com wrote:


 So, pig team, what is the right way to accomplish this?


 On Tue, Mar 9, 2010 at 10:50 PM, Mridul Muralidharan 
 mrid...@yahoo-inc.com wrote:

 On Tuesday 09 March 2010 04:13 AM, hc busy wrote:

 okay. Here's the bag that I have:

  {group: (a: int,b: chararray,c: chararray,d: int), TABLE: {number1:
 int,
 number2:int}}



 and I want to do this

 grunt  CALCULATE= FOREACH 

Re: Operating on Cogroups and Iterations in Pig Re: more bagging fun

2010-03-12 Thread Dmitriy Ryaboy
hc,
Good stuff. I was thinking along very similar lines with regards to allowing
mapping a function over a bag. I suspect a MAP can actually be written as a
udf. We'd just have to pass the name of the function to be mapped and call
InstantiateFuncFromSpec on it.

We may want a different name for it, as map and reduce are associated
with the Hadoop map and reduce stages when talking about Pig, and at some
point Pig may want to allow users to explicitly set up map and reduce jobs
-- as opposed to mapping functions to members of bags.

-D


On Fri, Mar 12, 2010 at 2:00 PM, hc busy hc.b...@gmail.com wrote:

 Hmm, okay, I read the documentation further and it appears that this has
 already been discussed previously
 (herehttp://wiki.apache.org/pig/PigTypesFunctionalSpec).There
 seem to be a question of what's the right thing to do. It seems clear to me
 though. When an operation like '*' is applied, this is clearly an item-wise
 operation that is to be applied to each member of the bag. If a function is
 aggregate (SUM), then it operates across an entire bag.

 When a COGROUP occurs, just do what SQL does. Which is to say, perform
 cross
 join if an aggregate has been applied across several bags. And do so
 automatically, so we don't have to type out the separate FLATTEN's

 grouped = COGROUP employee BY name, bonuses BY name;
 flattened = FOREACH grouped GENERATE group, *FLATTEN(employee),
 FLATTEN(bonuses);grouped_again = GROUP flattened BY group;
 total_compensation = FOREACH grouped_again GENERATE group,
 SUM(employee:salary * bonuses:multiplier);*

 So this should do the same:

 grouped = COGROUP employee BY name, bonuses BY name;
 total_compensation = FOREACH grouped GENERATE group,
 SUM(employee:salary * bonuses:multiplier);


 automatically, because that can only have one meaning.

 Alternatively, if it is desired to stay with a low-level language, the
 solution to all of this confusion around UDF's that take bag's and UDF's
 that operate on members of bags can be resolved if we do two things.

 1.) Allow UDF's to actually become first class citizens. This way we can
 pass UDF's to other UDF's.
 2.) introduce the concept of map() and reduce() operator over bags.

 This two things allows us more freedom and follows the paradigm of
 map-reducing more closely.

 grouped = COGROUP employee BY name, bonuses BY name;
 total_compensation = FOREACH grouped GENERATE group,
 reduce(SUM,map(*,employee::salary,bonuses::multiplier));


 Actually, this may deserve a separate keyword. Because map and reduce
 operate on single bags where as Pig introduces this concept of co-grouping,
 so we should have *comap *and *coreduce* that take functions and operate on
 multiple bags that are results of a *cogroup*.

 grouped = COGROUP employee BY name, bonuses BY name;
 total_compensation = FOREACH grouped GENERATE group,
 REDUCE(SUM,COMAP(*, employee::salary,bonuses::multiplier));


 This allows us to write efficiently, on one line, what would other wise be
 several aliases and unnecessary FLATTENed cross products.

 A second thing that I see is the recommendation of implementing looping
 constructs. I wonder if I may suggest, as a follow up to the above, that we
 beef up UDF's as first class citizens and add the ability to create UDF
 functions in Pig Latin with the ability to recurse.

 The reason why I think this is a better way to loop than *for(;;)* and *
 while(){}* and *do{}while()* statements is that recursive calls are
 functional and are more easily optimizable than imperative programming. The
 PigJournal http://wiki.apache.org/pig/PigJournal has an entry for all of
 these constructs and functions under the heading Extending Pig to Include
 Branching, Looping, and Functions, but because map-reduce paradigm is
 inherently functional, I would rather think that staying functional would
 be
 a better way to approach this improvement. So the minimal amount of
 additional features needed is to implement functions and branching and we
 would have loops as a side-effect of those improvements.

 In order for the optimizations to be available to PigLatin interpreter, the
 functions and branching *must* be implemented within the Pig system. If it
 is externalized, or implemented as UDL of some other language, then
 opportunities for optimization of the execution vanishes.


 Anyways, a couple of cents on a rainy day.




 On Wed, Mar 10, 2010 at 10:15 AM, hc busy hc.b...@gmail.com wrote:

  An additional thought... we can define udf's like
 
  ADD(bag{(int,int)}), DIVIDE(bag{(int,int)}), MULTIPLY(bag{(int,int)}),
  SQRT(bag{(float)})..
 
  basically vectorize most of the common arithmetic operations, but then
 the
  language has to support it by converting
 
  bag.a + bag.b
 
  to
 
  ADD(bag.(a,b))
 
  I guess there are some difficulties, for instance:
 
  SQRT(bag.a)+bag.b
 
  How would this work? because sqrt(bag.a) returns a bag, how would we
  convert it to the correct per tuple operation? It's almost like we want
 to
  

[jira] Commented: (PIG-1292) Interface Refinements

2010-03-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844724#action_12844724
 ] 

Xuefu Zhang commented on PIG-1292:
--

Looking at the OrderedLoadFunc interface, public WritableComparable? 
getSplitComparable(InputSplit split, int splitIdx), I am not sure why split 
index suddenly comes into the picture. Though it was in earlier discussion 
between Pig and Zebra, we agree that this is very implementation specific, 
which shouldn't dictate API design. Thus, I don't think that split index should 
be in the signature even if it helps Zebra implementation. If an implementation 
needs the split index, it can always store the index in the split it generates. 
That's what exactly Zebra plan to do.


 Interface Refinements
 -

 Key: PIG-1292
 URL: https://issues.apache.org/jira/browse/PIG-1292
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1292.patch, pig-interfaces.patch


 A loader can't implement both OrderedLoadFunc and IndexableLoadFunc, as both 
 are abstract classes instead of being interfaces.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1295) Binary comparator for secondary sort

2010-03-12 Thread Daniel Dai (JIRA)
Binary comparator for secondary sort


 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai


When hadoop framework doing the sorting, it will try to use binary version of 
comparator if available. The benefit of binary comparator is we do not need to 
instantiate the object before we compare. We see a ~30% speedup after we switch 
to binary comparator. Currently, Pig use binary comparator in following case:

1. When semantics of order doesn't matter. For example, in distinct, we need to 
do a sort in order to filter out duplicate values; however, we do not care how 
comparator sort keys. Groupby also share this character. In this case, we rely 
on hadoop's default binary comparator
2. Semantics of order matter, but the key is of simple type. In this case, we 
have implementation for simple types, such as integer, long, float, chararray, 
databytearray, string

However, if the key is a tuple and the sort semantics matters, we do not have a 
binary comparator implementation. This especially matters when we switch to use 
secondary sort. In secondary sort, we convert the inner sort of nested foreach 
into the secondary key and rely on hadoop to sorting on both main key and 
secondary key. The sorting key will become a two items tuple. Since the 
secondary key the sorting key of the nested foreach, so the sorting semantics 
matters. It turns out we do not have binary comparator once we use secondary 
sort, and we see a significant slow down.

Binary comparator for tuple should be doable once we understand the binary 
structure of the serialized tuple. We can focus on most common use cases first, 
which is group by followed by a nested sort. In this case, we will use 
secondary sort. Semantics of the first key does not matter but semantics of 
secondary key matters. We need to identify the boundary of main key and 
secondary key in the binary tuple buffer without instantiate tuple itself. Then 
if the first key equals, we use a binary comparator to compare secondary key. 
Secondary key can also be a complex data type, but for the first step, we focus 
on simple secondary key, which is the most common use case.

We mark this issue to be a candidate project for Google summer of code 2010 
program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-03-12 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-1178:


Status: Patch Available  (was: Open)

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Ying He
 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, 
 pig_1178_3.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-03-12 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-1178:


Attachment: pig_1178_3.patch

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Ying He
 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, 
 pig_1178_3.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1290) WeightedRangePartitioner should not check if input is empty if quantile file is empty

2010-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844795#action_12844795
 ] 

Hadoop QA commented on PIG-1290:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12438556/PIG-1290.patch
  against trunk revision 922169.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/247/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/247/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/247/console

This message is automatically generated.

 WeightedRangePartitioner should not check if input is empty if quantile file 
 is empty
 -

 Key: PIG-1290
 URL: https://issues.apache.org/jira/browse/PIG-1290
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.7.0

 Attachments: PIG-1290.patch


 Currently WeightedRangePartitioner checks if the input is also empty if the 
 quantile file is empty. For this it tries to read the input (which under the 
 covers will result in creating splits for the input etc). If the input is a 
 directory with many files, this could result in many calls to the namenode 
 from each task - this can be avoided.
 If the input is non empty and quantile file is empty, then we would error out 
 anyway (this should be confirmed). Also while fixing this jira we should 
 ensure that pig can still do order by on empty input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1290) WeightedRangePartitioner should not check if input is empty if quantile file is empty

2010-03-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1290:


Status: Patch Available  (was: Open)

Again there seem to be transient unrelated test failures - am resubmitting one 
more time - will also kick off a unit test run on my machine.

 WeightedRangePartitioner should not check if input is empty if quantile file 
 is empty
 -

 Key: PIG-1290
 URL: https://issues.apache.org/jira/browse/PIG-1290
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.7.0

 Attachments: PIG-1290.patch


 Currently WeightedRangePartitioner checks if the input is also empty if the 
 quantile file is empty. For this it tries to read the input (which under the 
 covers will result in creating splits for the input etc). If the input is a 
 directory with many files, this could result in many calls to the namenode 
 from each task - this can be avoided.
 If the input is non empty and quantile file is empty, then we would error out 
 anyway (this should be confirmed). Also while fixing this jira we should 
 ensure that pig can still do order by on empty input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.