[jira] Created: (PIG-1465) Filter inside foreach is broken

2010-06-25 Thread hc busy (JIRA)
Filter inside foreach is broken
---

 Key: PIG-1465
 URL: https://issues.apache.org/jira/browse/PIG-1465
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: hc busy


{quote}
% cat data.txt
x,a,1,a
x,a,2,a
x,a,3,b
x,a,4,b
y,a,1,a
y,a,2,a
y,a,3,b
y,a,4,b
% cat script.pig
a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
b = group a by ind;
describe b;
f = foreach b{
all_total = SUM(a.num);
fed  = filter a by (f1==f2);
some_total = (int)SUM(fed.num);
generate group as ind, all_total, some_total;
}
describe f;
dump f;
% pig -f script.pig
(x,a,1,a,,)
(x,a,2,a,,)
(x,a,3,b,,)
(x,a,4,b,,)
(y,a,1,a,,)
(y,a,2,a,,)
(y,a,3,b,,)
(y,a,4,b,,)
% cat what_I_expected
(x,10,3)
(y,10,3)
{quote}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1465) Filter inside foreach is broken

2010-06-25 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1465:
-

Description: 
{quote}
% cat data.txt
x,a,1,a
x,a,2,a
x,a,3,b
x,a,4,b
y,a,1,a
y,a,2,a
y,a,3,b
y,a,4,b
% cat script.pig
a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
b = group a by ind;
describe b;
f = foreach b\{
all_total = SUM(a.num);
fed  = filter a by (f1==f2);
some_total = (int)SUM(fed.num);
generate group as ind, all_total, some_total;
\}
describe f;
dump f;
% pig -f script.pig
(x,a,1,a,,)
(x,a,2,a,,)
(x,a,3,b,,)
(x,a,4,b,,)
(y,a,1,a,,)
(y,a,2,a,,)
(y,a,3,b,,)
(y,a,4,b,,)
% cat what_I_expected
(x,10,3)
(y,10,3)
{quote}

  was:
{quote}
% cat data.txt
x,a,1,a
x,a,2,a
x,a,3,b
x,a,4,b
y,a,1,a
y,a,2,a
y,a,3,b
y,a,4,b
% cat script.pig
a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
b = group a by ind;
describe b;
f = foreach b{
all_total = SUM(a.num);
fed  = filter a by (f1==f2);
some_total = (int)SUM(fed.num);
generate group as ind, all_total, some_total;
}
describe f;
dump f;
% pig -f script.pig
(x,a,1,a,,)
(x,a,2,a,,)
(x,a,3,b,,)
(x,a,4,b,,)
(y,a,1,a,,)
(y,a,2,a,,)
(y,a,3,b,,)
(y,a,4,b,,)
% cat what_I_expected
(x,10,3)
(y,10,3)
{quote}



 Filter inside foreach is broken
 ---

 Key: PIG-1465
 URL: https://issues.apache.org/jira/browse/PIG-1465
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: hc busy

 {quote}
 % cat data.txt
 x,a,1,a
 x,a,2,a
 x,a,3,b
 x,a,4,b
 y,a,1,a
 y,a,2,a
 y,a,3,b
 y,a,4,b
 % cat script.pig
 a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
 b = group a by ind;
 describe b;
 f = foreach b\{
 all_total = SUM(a.num);
 fed  = filter a by (f1==f2);
 some_total = (int)SUM(fed.num);
 generate group as ind, all_total, some_total;
 \}
 describe f;
 dump f;
 % pig -f script.pig
 (x,a,1,a,,)
 (x,a,2,a,,)
 (x,a,3,b,,)
 (x,a,4,b,,)
 (y,a,1,a,,)
 (y,a,2,a,,)
 (y,a,3,b,,)
 (y,a,4,b,,)
 % cat what_I_expected
 (x,10,3)
 (y,10,3)
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: does EvalFunc generate the entire bag always ?

2010-06-01 Thread hc busy
well, see that's the thing, the 'sort A by $0' is already nlg(n)

ahh, I see, my own example suffers from this problem.

I guess I'm wondering how 'limit' works in conjunction with UDF's... A
practical application escapes me right now, But if I do

C = foreach B{
   C1 = MyUdf(B.bag_on_b);
   C2 = limit C1 5;
}

does it know to push limit in this case?


On Thu, May 27, 2010 at 2:32 PM, Alan Gates ga...@yahoo-inc.com wrote:

 The default case is that a UDFs that take bags (such as COUNT, etc.) are
 handed the entire bag at once.  In the case where all UDFs in a foreach
 implement the algebraic interface and the expression itself is algebraic
 than the combiner will be used, thus significantly limiting the size of the
 bag handed to the UDF.  The accumulator does hand records to the UDF a few
 thousand at a time.  Currently it has no way to turn off the flow of
 records.

 What you want might be accomplished by the LIMIT operator, which can be
 used inside a nested foreach.  Something like:

 C = foreach B {
C1 = sort A by $0;
C2 = limit 5 C1;
generate myUDF(C2);
 }

 Alan.


 On May 26, 2010, at 11:59 AM, hc busy wrote:

  Hey, guys, how are Bags passed to EvalFunc stored?

 I was looking at the Accumulator interface and it says that the reason why
 this needed for COUNT and SUM is because EvalFunc always gives you the
 entire bag when the EvalFunc is run on a bag.

 I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code
 inside that does


 for(Tuple entry:inputDataBag){
  stuff
 }


 was an actual iterator that iterated on the bag sequentially without
 necessarily having the entire bag in memory all at once. ?? Because it's
 an
 iterator, so there's no way to do anything other than to stream through
 it.

 I'm looking at this because Accumulator has no way of telling Pig I've
 seen
 enough It streams through the entire bag no matter what happens. (like,
 hypothetically speaking, if I was writing 5th item of a sorted bag udf),
 after I see 5th of a 5 million entry bag, I want to stop executing if
 possible.

 Is there a easy way to make this happen?





does EvalFunc generate the entire bag always ?

2010-05-26 Thread hc busy
Hey, guys, how are Bags passed to EvalFunc stored?

I was looking at the Accumulator interface and it says that the reason why
this needed for COUNT and SUM is because EvalFunc always gives you the
entire bag when the EvalFunc is run on a bag.

I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code
inside that does


for(Tuple entry:inputDataBag){
  stuff
}


was an actual iterator that iterated on the bag sequentially without
necessarily having the entire bag in memory all at once. ?? Because it's an
iterator, so there's no way to do anything other than to stream through it.

I'm looking at this because Accumulator has no way of telling Pig I've seen
enough It streams through the entire bag no matter what happens. (like,
hypothetically speaking, if I was writing 5th item of a sorted bag udf),
after I see 5th of a 5 million entry bag, I want to stop executing if
possible.

Is there a easy way to make this happen?


[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-05-20 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869720#action_12869720
 ] 

hc busy commented on PIG-1150:
--

similarly, there's some code here on numerically stable and distributed 
calculation: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance


I mean, while we're at it, why not calculate all central moments?

{code}
centralMoments(x, y)
{code}
returns central moments of x up to y
{code}
centralMoments(x,3)
{code}
will return a tuple containing

(mean, variance, skew)



 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



need help again, what causes Cannot cast to Unknown ?

2010-05-04 Thread hc busy
Hey, guys, I managed to generate another horrendous error message (before
the plan completes). What typically causes this error to happen?

The script survives through all describes. (I can describe after all
assignments to aliases), but it still produces this error.

(running pit 0.5 on hadoop .20)

2010-05-03 22:54:22,054 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1051: Cannot cast to Unknown
2010-05-03 22:54:22,054 [main] ERROR org.apache.pig.tools.grunt.Grunt -
org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected
exception caused the validation to stop
at
org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
at
org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
at org.apache.pig.PigServer.compileLp(PigServer.java:818)
at org.apache.pig.PigServer.compileLp(PigServer.java:789)
at org.apache.pig.PigServer.execute(PigServer.java:758)
at org.apache.pig.PigServer.access$100(PigServer.java:89)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:947)
at org.apache.pig.PigServer.executeBatch(PigServer.java:249)
at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:320)
Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
ERROR 1060: Cannot resolve Join output schema
at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2360)
at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:201)
at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45)
at
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at
org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
... 14 more
Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
ERROR 1051: Cannot cast to Unknown
at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForJoinInnerPlan(TypeCheckingVisitor.java:2544)
at
org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2348)
... 19 more


Re: how to compare?

2010-04-28 Thread hc busy
I'm not sure. If the type of two things that I am comparing (typically same
field of tuples inside a bag) I expect it to throw an error instead of
ordering the results by the datatype.

Because if it doesn't, it will either error out later on in the pigscript or
it will be serialized out and some program and that program will read in
offending field and crash. I'd prefer it fail early than late.

Which is why I'm just casting to Comparable and calling compareTo. The
problem with that is that it depends on each of the Comparable's compareTo
method to handle errors in similar ways. and I see that it does by
calling into DataType.compare(circa l166 in DataByteArray for
BYTEARRAY's...) ahh I see, so by casting to comparable it does the same as
DataType.compare when the types are different.

H, I guess I want to stick to casting to Comparables, since the two ways
of calling them are identical. Unless people have other comments.



On Wed, Apr 28, 2010 at 3:57 AM, Gianmarco gianmarco@gmail.com wrote:

 Basically, DataType.compare() just calls the compareTo() method of the two
 objects after checking that the two types are the same.
 However, DataType.compare() does 2 things more than a simple compareTo().

 Firts, it is specialized for Maps, for which sizes are taken into account
 and keys are sorted.

 Second, it imposes an (arbitrary) order on different data types. In this
 way
 the types are not dependent on each other and there is a single point of
 control.

 So I think you should use DataType.compare() unless you are sure you do not
 need these features.

 Anyway, there is something that I do not understand.

 What I do not understand is why the function needs to switch on the
 datatype
 byte and cast the objects before calling the compareTo on them. Just
 casting
 them to Comparable and letting Java run the proper polymorphic method
 should
 work as well, right?




 On Wed, Apr 28, 2010 at 07:12, hc busy hc.b...@gmail.com wrote:

  guys, I'm implementing that ExtremalTupleByNthField and I have a question
  about comparison...
 
 
  So, when I have parsed out the two objects that I want to compare how do
 I
  perform that comparison? My current implementation assumes the data is
  Comparable (which they invariably are within pig) so I do
 
 
  int c = ((Comparable)o1).compareTo((Comparable)o2);
 
 
  now I also see that there's another compare that compares the two objects
  by:
 
 
  int c = DataType.compare(o1, o2, DataType.findType(o1),
  DataType.findType(o2));
 
 
 
  The initial methods works for all types I've tried (int, string, etc.)
 But
  the latter is used by another UDF already in SVN.
 
  What are your suggestions?
 
  (PIG-1386 is ticket where I've checked in the patch).
 



[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-27 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Status: Open  (was: Patch Available)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-27 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

   Status: Patch Available  (was: Open)
Fix Version/s: 0.8.0

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-27 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Status: Open  (was: Patch Available)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-27 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: (was: PIG-1386-trunk.patch)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-27 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: PIG-1386-trunk.patch

e503949c4f5f2667657ee02872aff5ce

Additional documentation and examples.

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-27 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Status: Patch Available  (was: Open)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-27 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: PIG-1386-trunk.patch

da673ab2d584faf903e8b49b63a03ade
 
spell check the documentation

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-27 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: (was: PIG-1386-trunk.patch)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException

how to compare?

2010-04-27 Thread hc busy
guys, I'm implementing that ExtremalTupleByNthField and I have a question
about comparison...


So, when I have parsed out the two objects that I want to compare how do I
perform that comparison? My current implementation assumes the data is
Comparable (which they invariably are within pig) so I do


int c = ((Comparable)o1).compareTo((Comparable)o2);


now I also see that there's another compare that compares the two objects
by:


int c = DataType.compare(o1, o2, DataType.findType(o1),
DataType.findType(o2));



The initial methods works for all types I've tried (int, string, etc.) But
the latter is used by another UDF already in SVN.

What are your suggestions?

(PIG-1386 is ticket where I've checked in the patch).


[jira] Commented: (PIG-1303) unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor

2010-04-26 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861248#action_12861248
 ] 

hc busy commented on PIG-1303:
--

+(google^2)

that worked!

 unable to set outgoing format for 
 org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
 

 Key: PIG-1303
 URL: https://issues.apache.org/jira/browse/PIG-1303
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
 Environment: pig 0.6.0 on a fedora linux machine, jdk 1.6 u11
Reporter: Johannes Rußek
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1303.patch, TypeCheckingVisitor.java.diff


 I'm unable to set the format of the outgoing date string in the constructor 
 as it's supposed to work. 
 The only way i could change the format was to change the default in the java 
 class and rebuild piggybank.
 Apparently this has something to do with the way pig instantiates 
 DateExtractor, quoting a replier on the mailing list:
 David Vrensk said:
 I ran into the same problem a couple of weeks ago, and
 played around with the code inserting some print/log statements.  It turns
 out that the arguments are only used in the initial constructor calls, when
 the pig process is starting, but once pig reaches the point where it would
 use the udf, it creates new DateExtractors without passing the arguments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-26 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: (was: PIG-1386-trunk.patch)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy

 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-26 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: PIG-1386-trunk.patch

1873fb8d75f7362df343615f623a7390

Added documentation, added a bunch of unit tests to test the functionalities 
that the documentation claims to have. cleaned up to revert to not requiring 
change to EvalFunc's constructor. Added ASF license text.

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag

[jira] Commented: (PIG-1385) UDF to create tuples and bags

2010-04-24 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860488#action_12860488
 ] 

hc busy commented on PIG-1385:
--

yeah! my first contrib. Thanks, Alan!!

 UDF to create tuples and bags
 -

 Key: PIG-1385
 URL: https://issues.apache.org/jira/browse/PIG-1385
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1385-trunk.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Based on this conversation:
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue;
  
   /**
   * @author astepachev
   */
   public class ToBagTest {
PigServer pigServer;
URL inputTxt;
  
@Before
public void init() throws IOException, URISyntaxException {
pigServer = new PigServer(ExecType.LOCAL);
inputTxt =
   this.getClass().getResource(bagTest.txt).toURI().toURL();
}
  
@Test
public void testSimple() throws IOException {
pigServer.registerQuery(a = load ' + inputTxt.toExternalForm

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-24 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: PIG-1386-trunk.patch

163812d67299dd4b44470c854c80f2a8

redo without the addition of the helper function.

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-24 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: (was: PIG-1386-trunk.patch)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL

[jira] Commented: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-24 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860490#action_12860490
 ] 

hc busy commented on PIG-1386:
--

Okay, here's an alternative. What if we did this instead:


{code}
class EvalFunc{
...
protected String parameters=;
public EvalFunc(Object... constructorParameters){

StringBuilder sb = new StringBuilder();
if(constructorParameters!=null  constructorParameters.length0){
for(Object o:constructorParameters){
sb.append(',');
sb.append('\'');
sb.append(o.toString());
sb.append('\'');}
parameters=(+sb.substring(1)+);
}
... //rest of evalfunc constructor.
...
}
{code}

and my getInitial is implemented thusly:

{code}
@Override
public String getInitial() {
return HelperClass.class.getName() + parameters;
}
{code}



 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override

[jira] Commented: (PIG-1303) unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor

2010-04-23 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860297#action_12860297
 ] 

hc busy commented on PIG-1303:
--

But the problem is that inside the EvalFunc constructor, in case of Algebraic 
classes, it constructs each of Initial, Intermediate and final which are 
EvalFunc's that, in my case, require a parameter to operate correctly.

If I declare the helper class that represent the initial/intermediate/final 


{code}
public class HelperClass extends EvalFuncTuple {
public HelperClass() {
super();
}

public Tuple exec(Tuple input) throws IOException {
return extreme(fieldIndex, sign, input, reporter);
}

}
{code}

where the fieldIndex and sign come from the surrounding class (note the class 
is not static) then the code crashes. It's not able to construct the 
HelperClass with this error

{quote}
could not instantiate 
'org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField$HelperClass' with 
arguments 'null'
java.lang.RuntimeException: could not instantiate 
'org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField$HelperClass' with 
arguments 'null'
at 
org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:498)
at org.apache.pig.EvalFunc.getReturnTypeFromSpec(EvalFunc.java:136)
at org.apache.pig.EvalFunc.init(EvalFunc.java:123)
at 
org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField.init(ExtremalTupleByNthField.java:77)
at 
org.apache.pig.piggybank.evaluation.TestExtremalTupleByNthField.testMin(Unknown 
Source)
Caused by: java.lang.InstantiationException: 
org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField$HelperClass
at java.lang.Class.newInstance0(Class.java:340)
at java.lang.Class.newInstance(Class.java:308)
at 
org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:468)
{quote}

Basically, I think it's not able to construct because the class can only be 
constructed from an instance of ExtremalTupleByNthField.
{code}
ExtremalTupleByNthField etbnf = new ExtremalTupleByNthField(1,max);
etbnf.new ExtremalTupleByNthField.HelperClass();
{code}

So my solution to this problem was to make this class static. But make it so 
that EvalFunc can take a vararg that will eventually contain the actual 
parameters.

the handleChildConstructorParameters method in the EvalFunc will construct a 
string that represents the call into the initial/intermediate/final methods but 
it contains parameters that came from the ExtremalTupleByNthField.

 unable to set outgoing format for 
 org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
 

 Key: PIG-1303
 URL: https://issues.apache.org/jira/browse/PIG-1303
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
 Environment: pig 0.6.0 on a fedora linux machine, jdk 1.6 u11
Reporter: Johannes Rußek
Assignee: Dmitriy V. Ryaboy
 Attachments: TypeCheckingVisitor.java.diff


 I'm unable to set the format of the outgoing date string in the constructor 
 as it's supposed to work. 
 The only way i could change the format was to change the default in the java 
 class and rebuild piggybank.
 Apparently this has something to do with the way pig instantiates 
 DateExtractor, quoting a replier on the mailing list:
 David Vrensk said:
 I ran into the same problem a couple of weeks ago, and
 played around with the code inserting some print/log statements.  It turns
 out that the arguments are only used in the initial constructor calls, when
 the pig process is starting, but once pig reaches the point where it would
 use the udf, it creates new DateExtractors without passing the arguments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1303) unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor

2010-04-23 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860310#action_12860310
 ] 

hc busy commented on PIG-1303:
--

Hmm, okay, so let me shorten my problem. Basically the functions 

getInitial, getIntermed, and getFinal in my Algebraic class doesn't have access 
to the constructor parameters. The reason is this. in Java, the super() 
constructor can only be called as the very first thing that the deriving 
class's constructor does, so my udfs has constructors that look like this:


{code}
 public ExtremalTupleByNthField(String fieldIndexString, String order) {
super();
parameters = ('+fieldIndexString+','+order+';
 }
   @Override
public String getInitial() {
return HelperClass.class.getName()+parameters;
}
{code}

But the problem is EvalFunc() constructor calls the child class's getInitial() 
to type check. When it does this, it finds that my getInitial() returns 
something in complete because the parameters member variable hasn't been 
initialized yet. This is a pretty mundane problem with java programs and the 
way to fix it is what I've submitted in the patch calling an overridden method 
in the super()'s constructor.

I mean, I don't see any other way to do this, but I'd be willing to work on 
another implementation if you can suggest one?



 unable to set outgoing format for 
 org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
 

 Key: PIG-1303
 URL: https://issues.apache.org/jira/browse/PIG-1303
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
 Environment: pig 0.6.0 on a fedora linux machine, jdk 1.6 u11
Reporter: Johannes Rußek
Assignee: Dmitriy V. Ryaboy
 Attachments: TypeCheckingVisitor.java.diff


 I'm unable to set the format of the outgoing date string in the constructor 
 as it's supposed to work. 
 The only way i could change the format was to change the default in the java 
 class and rebuild piggybank.
 Apparently this has something to do with the way pig instantiates 
 DateExtractor, quoting a replier on the mailing list:
 David Vrensk said:
 I ran into the same problem a couple of weeks ago, and
 played around with the code inserting some print/log statements.  It turns
 out that the arguments are only used in the initial constructor calls, when
 the pig process is starting, but once pig reaches the point where it would
 use the udf, it creates new DateExtractors without passing the arguments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-23 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860325#action_12860325
 ] 

hc busy commented on PIG-1386:
--

oops, posted to the wrong ticket:

{quote}
Hmm, okay, so let me shorten my problem. Basically the functions getInitial(), 
getIntermed(), and getFinal() in my Algebraic class doesn't have access to the 
constructor parameters. The reason is this. in Java, the super() constructor 
can only be called as the very first thing that the deriving class's 
constructor does, so my udfs has constructors that look like this:
{code}
public ExtremalTupleByNthField(String fieldIndexString, String order) {
super();
parameters = ('+fieldIndexString+','+order+';
 }
   @Override
public String getInitial() {
return HelperClass.class.getName()+parameters;
}
{code}
But the problem is EvalFunc() constructor initializes the EvalFunc as returned 
by getInitial() to type check. When it does this, it finds that my getInitial() 
returns something incomplete because the parameters member variable hasn't 
been initialized yet. This is a pretty mundane problem with java programs and 
the way to fix it is what I've submitted in the patch calling an overridden 
method in the super()'s constructor.

I mean, I don't see any other way to do this, but I'd be willing to work on 
another implementation if you can suggest one?


{quote}

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-23 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: (was: PIG-1386-trunk.patch)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-23 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: PIG-1386-trunk.patch

25ce97367cadfd2ea4be379c6f5c351d

Clean up documentation and refactor to unify parsing of constructor arguments 
in the two classes.

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before

[jira] Commented: (PIG-1385) UDF to create tuples and bags

2010-04-23 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860365#action_12860365
 ] 

hc busy commented on PIG-1385:
--

ok, ok, moving tests to evaluation.util requires that you import the classes 
under test.

Here we usually have tests in the same package (but sitting under test/ instead 
of src/) so we can test package protected methods. Also so we don't have to 
import the CUT. But other than that, I guess I should follow convention. I 
agree with these changes.

 UDF to create tuples and bags
 -

 Key: PIG-1385
 URL: https://issues.apache.org/jira/browse/PIG-1385
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1385-trunk.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Based on this conversation:
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue;
  
   /**
   * @author astepachev
   */
   public class ToBagTest {
PigServer pigServer;
URL inputTxt;
  
@Before
public void init() throws IOException, URISyntaxException

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: (was: PIG-1386-trunk.patch)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: PIG-1386-trunk.patch

checked to be sure the unittest builds and runs.

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import

[jira] Updated: (PIG-1385) UDF to create tuples and bags

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1385:
-

Attachment: (was: PIG-1385-trunk.patch)

 UDF to create tuples and bags
 -

 Key: PIG-1385
 URL: https://issues.apache.org/jira/browse/PIG-1385
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
   Original Estimate: 24h
  Remaining Estimate: 24h

 Based on this conversation:
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue;
  
   /**
   * @author astepachev
   */
   public class ToBagTest {
PigServer pigServer;
URL inputTxt;
  
@Before
public void init() throws IOException, URISyntaxException {
pigServer = new PigServer(ExecType.LOCAL);
inputTxt =
   this.getClass().getResource(bagTest.txt).toURI().toURL();
}
  
@Test
public void testSimple() throws IOException {
pigServer.registerQuery(a = load ' + inputTxt.toExternalForm()
  +
   ' using PigStorage(',')  +
as (id:int, a:chararray, b:chararray, c:chararray,
   d:chararray

[jira] Updated: (PIG-1385) UDF to create tuples and bags

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1385:
-

Status: Open  (was: Patch Available)

 UDF to create tuples and bags
 -

 Key: PIG-1385
 URL: https://issues.apache.org/jira/browse/PIG-1385
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1385-trunk.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Based on this conversation:
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue;
  
   /**
   * @author astepachev
   */
   public class ToBagTest {
PigServer pigServer;
URL inputTxt;
  
@Before
public void init() throws IOException, URISyntaxException {
pigServer = new PigServer(ExecType.LOCAL);
inputTxt =
   this.getClass().getResource(bagTest.txt).toURI().toURL();
}
  
@Test
public void testSimple() throws IOException {
pigServer.registerQuery(a = load ' + inputTxt.toExternalForm()
  +
   ' using PigStorage(',')  +
as (id:int, a:chararray, b:chararray

[jira] Updated: (PIG-1385) UDF to create tuples and bags

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1385:
-

Attachment: PIG-1385-trunk.patch

changed so that the unit test builds and runs.

 UDF to create tuples and bags
 -

 Key: PIG-1385
 URL: https://issues.apache.org/jira/browse/PIG-1385
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1385-trunk.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Based on this conversation:
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue;
  
   /**
   * @author astepachev
   */
   public class ToBagTest {
PigServer pigServer;
URL inputTxt;
  
@Before
public void init() throws IOException, URISyntaxException {
pigServer = new PigServer(ExecType.LOCAL);
inputTxt =
   this.getClass().getResource(bagTest.txt).toURI().toURL();
}
  
@Test
public void testSimple() throws IOException {
pigServer.registerQuery(a = load ' + inputTxt.toExternalForm()
  +
   ' using PigStorage

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Status: Open  (was: Patch Available)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Status: Patch Available  (was: Open)

resubmitting patch for the build system to check.

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import

[jira] Commented: (PIG-1303) unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor

2010-04-22 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860103#action_12860103
 ] 

hc busy commented on PIG-1303:
--

Okay, so, here's a thought:

I'm kind of stuck writing the initial/intermed/Final methods for an algebraic 
EvalFunc that has constructor parameters because I couldn't pass the parameters 
in.


A suggestion is to do this (without being incompatible with previous versions)

Alter EvalFunc's profile so that

{code}
public abstract class EvalFuncT  {

   protected handleChildConstructorParameters(Object... childConstructor){
  // by default do nothing.
   }

public EvalFunc(Object... constructorParameters){
handleChildConstructorParameters(constructorParameters);
... then do everything else it used to do.
}
}
{code}


The reason why this is necessary is because I'll need to overrite 
handleChildConstructorParameters in my Algebraic EvalFunc to do some things 
before the rest of EvalFunc()'s constructor continues. This will help fix this 
date format problem for Algebraic evalfunc's.




 unable to set outgoing format for 
 org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
 

 Key: PIG-1303
 URL: https://issues.apache.org/jira/browse/PIG-1303
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
 Environment: pig 0.6.0 on a fedora linux machine, jdk 1.6 u11
Reporter: Johannes Rußek
Assignee: Dmitriy V. Ryaboy
 Attachments: TypeCheckingVisitor.java.diff


 I'm unable to set the format of the outgoing date string in the constructor 
 as it's supposed to work. 
 The only way i could change the format was to change the default in the java 
 class and rebuild piggybank.
 Apparently this has something to do with the way pig instantiates 
 DateExtractor, quoting a replier on the mailing list:
 David Vrensk said:
 I ran into the same problem a couple of weeks ago, and
 played around with the code inserting some print/log statements.  It turns
 out that the arguments are only used in the initial constructor calls, when
 the pig process is starting, but once pig reaches the point where it would
 use the udf, it creates new DateExtractors without passing the arguments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-22 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860104#action_12860104
 ] 

hc busy commented on PIG-1386:
--

I'm having trouble writing this UDF because of the bug similar to PIG-1303; 
Here's my comment to that ticket below. It seems that by doing this, it allows 
me to pass on the constructor parameters:
{quote}
Okay, so, here's a thought:
I'm kind of stuck writing the initial/intermed/Final methods for an algebraic 
EvalFunc that has constructor parameters because I couldn't pass the parameters 
in.

A suggestion is to do this (without being incompatible with previous versions)

Alter EvalFunc's profile so that
{code}
public abstract class EvalFuncT  {

   protected handleChildConstructorParameters(Object... childConstructor){
  // by default do nothing.
   }

public EvalFunc(Object... constructorParameters){
handleChildConstructorParameters(constructorParameters);
... then do everything else it used to do.
}
}
{code}
The reason why this is necessary is because I'll need to overrite 
handleChildConstructorParameters in my Algebraic EvalFunc to do some things 
before the rest of EvalFunc()'s constructor continues. This will help fix this 
date format problem for Algebraic evalfunc's.
{quote}


 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: PIG-1386-trunk.patch

a92218b0c641363439af8f2d9e5ecbc0

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-22 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: (was: PIG-1386-trunk.patch)

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL

[jira] Updated: (PIG-1385) UDF to create tuples and bags

2010-04-20 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1385:
-

Affects Version/s: 0.6.0
  Description: 
Based on this conversation:

 On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:

  What about making them part of the language using symbols?
 
  instead of
 
  foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
 
  have language support
 
  foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
 
  or even:
 
  foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
 
 
  Is there reason not to do the second or third other than being more
  complicated?
 
  Certainly I'd volunteer to put the top implementation in to the util
  package and submit them for builtin's, but the latter syntactic candies
  seems more natural..
 
 
 
  On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
 
  The grouping package in piggybank is left over from back when Pig
 allowed
  users to define grouping functions (0.1).  Functions like these should
 go in
  evaluation.util.
 
  However, I'd consider putting these in builtin (in main Pig) instead.
   These are things everyone asks for and they seem like a reasonable
 addition
  to the core engine.  This will be more of a burden to write (as we'll
 hold
  them to a higher standard) but of more use to people as well.
 
  Alan.
 
 
  On Apr 19, 2010, at 12:53 PM, hc busy wrote:
 
   Some times I wonder... I mean, somebody went to the trouble of making a
  path
  called
 
  org.apache.pig.piggybank.grouping
 
  (where it seems like this code belong), but didn't check in any java
 code
  into that package.
 
 
  Any comment about where to put this kind of utility classes?
 
 
 
  On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
 
   2010/4/19 hc busy hc.b...@gmail.com
 
   That's just the way it is right now, you can't make bags or tuples
  directly... Maybe we should have some UDF's in piggybank for these:
 
  toBag()
  toTuple(); --which is kinda like exec(Tuple in){return in;}
  TupleToBag(); --some times you need it this way for some reason.
 
 
   Ok. I place my current code here, may be later I make a patch (if
 such
  implementation is acceptable of course).
 
  import org.apache.pig.EvalFunc;
  import org.apache.pig.data.BagFactory;
  import org.apache.pig.data.DataBag;
  import org.apache.pig.data.Tuple;
  import org.apache.pig.data.TupleFactory;
 
  import java.io.IOException;
 
  /**
  * Convert any sequence of fields to bag with specified count of
  fieldsbr
  * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
  * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
  *
  * @author astepachev
  */
  public class ToBag extends EvalFuncDataBag {
   public BagFactory bagFactory;
   public TupleFactory tupleFactory;
 
   public ToBag() {
   bagFactory = BagFactory.getInstance();
   tupleFactory = TupleFactory.getInstance();
   }
 
   @Override
   public DataBag exec(Tuple input) throws IOException {
   if (input.isNull())
   return null;
   final DataBag bag = bagFactory.newDefaultBag();
   final Integer couter = (Integer) input.get(0);
   if (couter == null)
   return null;
   Tuple tuple = tupleFactory.newTuple();
   for (int i = 0; i  input.size() - 1; i++) {
   if (i % couter == 0) {
   tuple = tupleFactory.newTuple();
   bag.add(tuple);
   }
   tuple.append(input.get(i + 1));
   }
   return bag;
   }
  }
 
  import org.apache.pig.ExecType;
  import org.apache.pig.PigServer;
  import org.junit.Before;
  import org.junit.Test;
 
  import java.io.IOException;
  import java.net.URISyntaxException;
  import java.net.URL;
 
  import static org.junit.Assert.assertTrue;
 
  /**
  * @author astepachev
  */
  public class ToBagTest {
   PigServer pigServer;
   URL inputTxt;
 
   @Before
   public void init() throws IOException, URISyntaxException {
   pigServer = new PigServer(ExecType.LOCAL);
   inputTxt =
  this.getClass().getResource(bagTest.txt).toURI().toURL();
   }
 
   @Test
   public void testSimple() throws IOException {
   pigServer.registerQuery(a = load ' + inputTxt.toExternalForm()
 +
  ' using PigStorage(',')  +
   as (id:int, a:chararray, b:chararray, c:chararray,
  d:chararray););
   pigServer.registerQuery(last = foreach a generate flatten( +
  ToBag.class.getName() + (2, id, a, id, b, id, c)););
 
   pigServer.deleteFile(target/pigtest/func1.txt);
   pigServer.store(last, target/pigtest/func1.txt);
   assertTrue(pigServer.fileSize(target/pigtest/func1.txt)  0);
   }
  }
 
 
 
 



  was:
Based on this conversation:

totally, go for it, it'd be pretty straightforward to add this
functionality.
- Hide quoted text -



On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:

 Hey, while we're

[jira] Created: (PIG-1387) Syntactical Sugar for PIG-1385

2010-04-20 Thread hc busy (JIRA)
Syntactical Sugar for PIG-1385
--

 Key: PIG-1387
 URL: https://issues.apache.org/jira/browse/PIG-1387
 Project: Pig
  Issue Type: Wish
  Components: grunt
Affects Versions: 0.6.0
Reporter: hc busy


From this conversation, extend PIG-1385 to instead of calling UDF use built-in 
behavior when the (),{},[] groupings are encountered.


  What about making them part of the language using symbols?
 
  instead of
 
  foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
 
  have language support
 
  foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
 
  or even:
 
  foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
 
 
  Is there reason not to do the second or third other than being more
  complicated?
 
  Certainly I'd volunteer to put the top implementation in to the util
  package and submit them for builtin's, but the latter syntactic candies
  seems more natural..
 
 
 
  On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
 
  The grouping package in piggybank is left over from back when Pig
 allowed
  users to define grouping functions (0.1).  Functions like these should
 go in
  evaluation.util.
 
  However, I'd consider putting these in builtin (in main Pig) instead.
   These are things everyone asks for and they seem like a reasonable
 addition
  to the core engine.  This will be more of a burden to write (as we'll
 hold
  them to a higher standard) but of more use to people as well.
 
  Alan.
 
 
  On Apr 19, 2010, at 12:53 PM, hc busy wrote:
 
   Some times I wonder... I mean, somebody went to the trouble of making a
  path
  called
 
  org.apache.pig.piggybank.grouping
 
  (where it seems like this code belong), but didn't check in any java
 code
  into that package.
 
 
  Any comment about where to put this kind of utility classes?
 
 
 
  On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
 
   2010/4/19 hc busy hc.b...@gmail.com
 
   That's just the way it is right now, you can't make bags or tuples
  directly... Maybe we should have some UDF's in piggybank for these:
 
  toBag()
  toTuple(); --which is kinda like exec(Tuple in){return in;}
  TupleToBag(); --some times you need it this way for some reason.
 
 
   Ok. I place my current code here, may be later I make a patch (if
 such
  implementation is acceptable of course).
 
  import org.apache.pig.EvalFunc;
  import org.apache.pig.data.BagFactory;
  import org.apache.pig.data.DataBag;
  import org.apache.pig.data.Tuple;
  import org.apache.pig.data.TupleFactory;
 
  import java.io.IOException;
 
  /**
  * Convert any sequence of fields to bag with specified count of
  fieldsbr
  * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
  * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
  *
  * @author astepachev
  */
  public class ToBag extends EvalFuncDataBag {
   public BagFactory bagFactory;
   public TupleFactory tupleFactory;
 
   public ToBag() {
   bagFactory = BagFactory.getInstance();
   tupleFactory = TupleFactory.getInstance();
   }
 
   @Override
   public DataBag exec(Tuple input) throws IOException {
   if (input.isNull())
   return null;
   final DataBag bag = bagFactory.newDefaultBag();
   final Integer couter = (Integer) input.get(0);
   if (couter == null)
   return null;
   Tuple tuple = tupleFactory.newTuple();
   for (int i = 0; i  input.size() - 1; i++) {
   if (i % couter == 0) {
   tuple = tupleFactory.newTuple();
   bag.add(tuple);
   }
   tuple.append(input.get(i + 1));
   }
   return bag;
   }
  }
 
  import org.apache.pig.ExecType;
  import org.apache.pig.PigServer;
  import org.junit.Before;
  import org.junit.Test;
 
  import java.io.IOException;
  import java.net.URISyntaxException;
  import java.net.URL;
 
  import static org.junit.Assert.assertTrue;
 
  /**
  * @author astepachev
  */
  public class ToBagTest {
   PigServer pigServer;
   URL inputTxt;
 
   @Before
   public void init() throws IOException, URISyntaxException {
   pigServer = new PigServer(ExecType.LOCAL);
   inputTxt =
  this.getClass().getResource(bagTest.txt).toURI().toURL();
   }
 
   @Test
   public void testSimple() throws IOException {
   pigServer.registerQuery(a = load ' + inputTxt.toExternalForm()
 +
  ' using PigStorage(',')  +
   as (id:int, a:chararray, b:chararray, c:chararray,
  d:chararray););
   pigServer.registerQuery(last = foreach a generate flatten( +
  ToBag.class.getName() + (2, id, a, id, b, id, c)););
 
   pigServer.deleteFile(target/pigtest/func1.txt);
   pigServer.store(last, target/pigtest/func1.txt);
   assertTrue(pigServer.fileSize(target/pigtest/func1.txt)  0);
   }
  }
 
 
 
 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-20 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Status: Patch Available  (was: Open)

Here's a first stab.

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy

 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue

[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-04-20 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1386:
-

Attachment: PIG-1386-trunk.patch

The patch

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static

[jira] Updated: (PIG-1385) UDF to create tuples and bags

2010-04-20 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1385:
-

Attachment: PIG-1385-trunk.patch

 UDF to create tuples and bags
 -

 Key: PIG-1385
 URL: https://issues.apache.org/jira/browse/PIG-1385
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
 Attachments: PIG-1385-trunk.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Based on this conversation:
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue;
  
   /**
   * @author astepachev
   */
   public class ToBagTest {
PigServer pigServer;
URL inputTxt;
  
@Before
public void init() throws IOException, URISyntaxException {
pigServer = new PigServer(ExecType.LOCAL);
inputTxt =
   this.getClass().getResource(bagTest.txt).toURI().toURL();
}
  
@Test
public void testSimple() throws IOException {
pigServer.registerQuery(a = load ' + inputTxt.toExternalForm()
  +
   ' using PigStorage(',')  +
as (id:int, a:chararray, b:chararray, c:chararray,
   d:chararray

[jira] Updated: (PIG-1385) UDF to create tuples and bags

2010-04-20 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1385:
-

Status: Patch Available  (was: Open)

 UDF to create tuples and bags
 -

 Key: PIG-1385
 URL: https://issues.apache.org/jira/browse/PIG-1385
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
 Attachments: PIG-1385-trunk.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Based on this conversation:
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue;
  
   /**
   * @author astepachev
   */
   public class ToBagTest {
PigServer pigServer;
URL inputTxt;
  
@Before
public void init() throws IOException, URISyntaxException {
pigServer = new PigServer(ExecType.LOCAL);
inputTxt =
   this.getClass().getResource(bagTest.txt).toURI().toURL();
}
  
@Test
public void testSimple() throws IOException {
pigServer.registerQuery(a = load ' + inputTxt.toExternalForm()
  +
   ' using PigStorage(',')  +
as (id:int, a:chararray, b:chararray, c:chararray,
   d:chararray

Re: incorrect Inner Join result for multi column join with null values in join key

2010-04-16 Thread hc busy
Cool! can't wait until CDH has 0.7...

Kinda surprised that nobody encountered this problem before... Can I file a
ticket?

On Fri, Apr 16, 2010 at 10:21 AM, Alan Gates ga...@yahoo-inc.com wrote:


 On Apr 16, 2010, at 9:37 AM, hc busy wrote:

  What scott noticed is present when the multiple column join key is used in
 a
 distributed setting. The trap is that when you unit test the behavior/PIG
 script and it does the join right in a local environment and then you get
 F'ed after u deploy to production in distributed enviro.


 In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll avoid
 that will fix these issues with development and deployment differences.

 Alan.




 On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey sc...@richrelevance.com
 wrote:

  CDH2 Pig 0.5+.   Mapred mode, with CDH2 0.20.1+  Both latest as of 2
 weeks
 ago.

 Joins on multiple columns have null key values matching.

 IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
 f1:int, f2:int);
 IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
 f1:int, f2:int);
 --- both the above are the same

 dump IN;
 (,1,2,3)
 (,-5,5,5)
 ( ,100,200,300)
 (  ,0,200,300)
 (a,4,5,6)
 (a,7,8,9)
 (b,10,11,12)
 (b,11,11,12)

 IN_NULLS = FILTER IN BY ind is NULL;
 dump IN_NULLS;
 (,1,2,3)
 (,-5,5,5)

 J1 = JOIN IN by (ind), IN2 by (ind);
 dump J1;
 (  ,0,200,300,  ,0,200,300)
 (a,4,5,6,a,4,5,6)
 (a,4,5,6,a,7,8,9)
 (a,7,8,9,a,4,5,6)
 (a,7,8,9,a,7,8,9)
 ( ,100,200,300, ,100,200,300)
 (b,10,11,12,b,10,11,12)
 (b,10,11,12,b,11,11,12)
 (b,11,11,12,b,10,11,12)
 (b,11,11,12,b,11,11,12)

 The above is the expected result of the self-join on the first column.

 J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
 dump J2;
 (  ,0,200,300,  ,0,200,300)
 ( ,100,200,300, ,100,200,300)
 (a,4,5,6,a,4,5,6)
 (a,7,8,9,a,7,8,9)
 (b,10,11,12,b,10,11,12)
 (b,11,11,12,b,11,11,12)
 (,-5,5,5,,-5,5,5)
 (,1,2,3,,1,2,3)


 The above is incorrect, since it matched the rows that have NULL for the
 ind field.

 There is a work-around, by explicitly filtering for null on the join
 columns before the join, but the above still looks incorrect to me.
 I suspect it is fixed in 0.6 or later, but I have not been able to find a
 JIRA ticket or message on this list about this.








Re: Begin a discussion about Pig as a top level project

2010-04-05 Thread hc busy
The Twitter office is cushier and has more bars within stumbling
distance. Just sayin'.

and strip clubs too, I gather there're a couple on Market... near civic bart
stop ;-)

oh... hey, you guys are at a nice place... lot's of night clubs near there
too .


 Given that, do you think it makes sense to say that Pig stays a
subproject for now, but if it someday grows beyond Hadoop only it becomes a
TLP?  I could agree to that stance.


Oops, I didn't read your whole message... I think TLP could be part of the
roadmap: Planned publicity, like planned pregnancy, is a good thing.

And on the way there, we should add dedicated resource that updates
documentation and links on the website... :-)




On Mon, Apr 5, 2010 at 12:10 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 The Twitter office is cushier and has more bars within stumbling distance.
 Just sayin'.

 To the subject at hand -- I don't think TLP standing has the PR value you
 think it does... feature set, velocity of development, adoption,
 flexibility, etc -- those are far more important.

 -Dmitriy

 On Mon, Apr 5, 2010 at 11:58 AM, hc busy hc.b...@gmail.com wrote:

   Of course I'd love it if someday there is an ISO Pig Latin committee
  (with
  meetings in cool exotic places) deciding the official standard for Pig
  Latin.
 
  haha!!! Some exotic place like Yahoo's  HQ in sunny Sunnyvale California?
 
  I guess it feels like it depends on the roadmap more than roadmap depends
  on
  it. In terms of positioning, a TLP would appear to potential users who
 are
  evaluating alternatives to consider it as _the_ choice as opposed to one
 of
  the choices. If the ambition is to take it there, then TLP, as useless as
  it
  may seem right now, might actually be worth the effort to attain.
 
  I mean, would you rather wait until Hive makes TLP and then play catch
 up?
  I
  mean, I can kinda see them doing that...
 
 
 
 
  On Mon, Apr 5, 2010 at 11:36 AM, Alan Gates ga...@yahoo-inc.com wrote:
 
   Prognostication is a difficult business.  Of course I'd love it if
  someday
   there is an ISO Pig Latin committee (with meetings in cool exotic
 places)
   deciding the official standard for Pig Latin.  But that seems like
 saying
  in
   your start up's business plan, When we reach Google's size, then we'll
  do
   x.  If there ever is an ISO Pig Latin standard it will be years off.
  
   As others have noted, staying tight to Hadoop now has many advantages,
  both
   in technical and adoption terms.  Hence my advocacy of keeping Pig
 Latin
   Hadoop agnostic while tightly integrating the backend.  Which is to say
  that
   in my view, Pig is Hadoop specific now, but there may come a day when
  that
   is no longer true.   Whether Pig will ever move past just running on
  Hadoop
   to running in other parallel systems won't be known for years to come.
Given that, do you think it makes sense to say that Pig stays a
  subproject
   for now, but if it someday grows beyond Hadoop only it becomes a TLP?
  I
   could agree to that stance.
  
   Alan.
  
  
   On Apr 3, 2010, at 12:43 PM, Santhosh Srinivasan wrote:
  
I see this as a multi-part question. Looking back at some of the
   significant roadmap/existential questions asked in the last 12 months,
 I
   see the following:
  
   1. With the introduction of SQL, what is the philosophy of Pig (I sent
   an email about this approximately 9 months ago)
   2. What is the approach to support backward compatibility in Pig (Alan
   had sent an email about this 3 months ago)
   3. Should Pig be a TLP (the current email thread).
  
   Here is my take on answering the aforementioned questions.
  
   The initial philosophy of Pig was to be backend agnostic. It was
   designed as a data flow language. Whenever a new language is designed,
   the syntax and semantics of the language have to be laid out. The
 syntax
   is usually captured in the form of a BNF grammar. The semantics are
   defined by the language creators. Backward compatibility is then a
   question of holding true to the syntax and semantics. With Pig, in
   addition to the language, the Java APIs were exposed to customers to
   implement UDFs (load/store/filter/grouping/row transformation etc),
   provision looping since the language does not support looping
 constructs
   and also support a programmatic mode of access. Backward compatibility
   in this context is to support API versioning.
  
   Do we still intend to position as a data flow language that is backend
   agnostic? If the answer is yes, then there is a strong case for making
   Pig a TLP.
  
   Are we influenced by Hadoop? A big YES! The reason Pig chose to become
 a
   Hadoop sub-project was to ride the Hadoop popularity wave. As a
   consequence, we chose to be heavily influenced by the Hadoop roadmap.
  
   Like a good lawyer, I also have rebuttals to Alan's questions :)
  
   1. Search engine popularity - We can discuss this with the Hadoop team
   and still retain links to TLP's

What should FLATTEN do?

2010-04-02 Thread hc busy
Guys, I have a row containing a map

'id','data', {((1,2)), ((2,3)), ((4,5))}

What is the expected behavior when I flatten on that bag? I had expected it
to result in

'id','data', (1,2)
'id','data', (2,3)
'id','data', (4,5)


But it appears to me that the result of applying FLATTEN to that bag is this
instead:

'id','data', 1,2
'id','data', 2,3
'id','data', 4,5


The latter is returned by the current cloudera's CDH2 and I've seen the
prior behavior on other versions of pig.

Which is the correct behavior by design?

What will pig 0.6 do when it is released?

thanks!


Re: What should FLATTEN do?

2010-04-02 Thread hc busy
doh s/map/bag/g

I seem to get maps and bags mixed up or some reason...

Guys, I have a row containing a *bag*

'id','data', {((1,2)), ((2,3)), ((4,5))}

What is the expected behavior when I flatten on that bag? I had expected it
to result in

'id','data', (1,2)
'id','data', (2,3)
'id','data', (4,5)


But it appears to me that the result of applying FLATTEN to that bag is this
instead:

'id','data', 1,2
'id','data', 2,3
'id','data', 4,5


The latter is returned by the current cloudera's CDH2 and I've seen the
prior behavior on other versions of pig.

Which is the correct behavior by design?

What will pig 0.6 do when it is released?

thanks!
On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com wrote:

 Guys, I have a row containing a map

 'id','data', {((1,2)), ((2,3)), ((4,5))}

 What is the expected behavior when I flatten on that bag? I had expected it
 to result in

 'id','data', (1,2)
 'id','data', (2,3)
 'id','data', (4,5)


 But it appears to me that the result of applying FLATTEN to that bag is
 this instead:

 'id','data', 1,2
 'id','data', 2,3
 'id','data', 4,5


 The latter is returned by the current cloudera's CDH2 and I've seen the
 prior behavior on other versions of pig.

 Which is the correct behavior by design?

 What will pig 0.6 do when it is released?

 thanks!



Re: What should FLATTEN do?

2010-04-02 Thread hc busy
Yeah, I'm sure it has nested tuples. Pig doesn't natively support
introduction of tuples

h = foreach g generate ((x,y,z)), (x), x

doesn't work, but i have a udf that does that don't ask why, and
I've seen it print double pair of paren's when I took a dump.

Our hadoop guys here says it's CDH2 and that the upgrade was just
re-installation of CDH2... (same jars) But certainly my script suddenly
started doing weird things when it flattened that all the way through.

I'd support the prior behavior as well, because that seems to match my
reading of documentation on behavior of FLATTEN.



Has anybody else had this problem with recent cloudera/pig versions?


thnx!!


On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman zaki.raha...@gmail.comwrote:

 Stupid question but are you sure your bag has the dual sets of parentheses?
 (And if I may ask, why is that the case?)

 On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman zaki.raha...@gmail.com
 wrote:

  If I'm not mistaken, the output is the expected behavior. Flatten should
  unnest bags. I'm assuming your statement is something like FOREACH ...
  GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first
 two
  fields of a tuple for every tuple in the nested bag.
 
 
 
 
  On Fri, Apr 2, 2010 at 2:02 PM, hc busy hc.b...@gmail.com wrote:
 
  doh s/map/bag/g
 
  I seem to get maps and bags mixed up or some reason...
 
  Guys, I have a row containing a *bag*
 
  'id','data', {((1,2)), ((2,3)), ((4,5))}
 
  What is the expected behavior when I flatten on that bag? I had expected
  it
  to result in
 
  'id','data', (1,2)
  'id','data', (2,3)
  'id','data', (4,5)
 
 
  But it appears to me that the result of applying FLATTEN to that bag is
  this
  instead:
 
  'id','data', 1,2
  'id','data', 2,3
  'id','data', 4,5
 
 
  The latter is returned by the current cloudera's CDH2 and I've seen the
  prior behavior on other versions of pig.
 
  Which is the correct behavior by design?
 
  What will pig 0.6 do when it is released?
 
  thanks!
  On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com wrote:
 
   Guys, I have a row containing a map
  
   'id','data', {((1,2)), ((2,3)), ((4,5))}
  
   What is the expected behavior when I flatten on that bag? I had
 expected
  it
   to result in
  
   'id','data', (1,2)
   'id','data', (2,3)
   'id','data', (4,5)
  
  
   But it appears to me that the result of applying FLATTEN to that bag
 is
   this instead:
  
   'id','data', 1,2
   'id','data', 2,3
   'id','data', 4,5
  
  
   The latter is returned by the current cloudera's CDH2 and I've seen
 the
   prior behavior on other versions of pig.
  
   Which is the correct behavior by design?
  
   What will pig 0.6 do when it is released?
  
   thanks!
  
 
 
 
 
  --
  Zaki Rahaman
 
 


 --
 Zaki Rahaman



Re: What should FLATTEN do?

2010-04-02 Thread hc busy
 yeah, you have to implement outputSchema() method on the udf in order
to make the content of the tuple visible... There's a nice example in the
UDF Manual

http://hadoop.apache.org/pig/docs/r0.6.0/udf.html

http://hadoop.apache.org/pig/docs/r0.6.0/udf.htmlsearch for 'package
myudf' until u find it.



On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney russell.jur...@gmail.comwrote:

 Not sure if this is exactly the same, but when I've created tuples within
 tuples in UDFs (to preserve order of pairs), from bag input, Pig has
 allowed
 it - but I can't work with that data in subsequent steps.

 On Fri, Apr 2, 2010 at 12:37 PM, hc busy hc.b...@gmail.com wrote:

  Yeah, I'm sure it has nested tuples. Pig doesn't natively support
  introduction of tuples
 
  h = foreach g generate ((x,y,z)), (x), x
 
  doesn't work, but i have a udf that does that don't ask why, and
  I've seen it print double pair of paren's when I took a dump.
 
  Our hadoop guys here says it's CDH2 and that the upgrade was just
  re-installation of CDH2... (same jars) But certainly my script suddenly
  started doing weird things when it flattened that all the way through.
 
  I'd support the prior behavior as well, because that seems to match my
  reading of documentation on behavior of FLATTEN.
 
 
 
  Has anybody else had this problem with recent cloudera/pig versions?
 
 
  thnx!!
 
 
  On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman zaki.raha...@gmail.com
  wrote:
 
   Stupid question but are you sure your bag has the dual sets of
  parentheses?
   (And if I may ask, why is that the case?)
  
   On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman zaki.raha...@gmail.com
   wrote:
  
If I'm not mistaken, the output is the expected behavior. Flatten
  should
unnest bags. I'm assuming your statement is something like FOREACH
 ...
GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
  first
   two
fields of a tuple for every tuple in the nested bag.
   
   
   
   
On Fri, Apr 2, 2010 at 2:02 PM, hc busy hc.b...@gmail.com wrote:
   
doh s/map/bag/g
   
I seem to get maps and bags mixed up or some reason...
   
Guys, I have a row containing a *bag*
   
'id','data', {((1,2)), ((2,3)), ((4,5))}
   
What is the expected behavior when I flatten on that bag? I had
  expected
it
to result in
   
'id','data', (1,2)
'id','data', (2,3)
'id','data', (4,5)
   
   
But it appears to me that the result of applying FLATTEN to that bag
  is
this
instead:
   
'id','data', 1,2
'id','data', 2,3
'id','data', 4,5
   
   
The latter is returned by the current cloudera's CDH2 and I've seen
  the
prior behavior on other versions of pig.
   
Which is the correct behavior by design?
   
What will pig 0.6 do when it is released?
   
thanks!
On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com wrote:
   
 Guys, I have a row containing a map

 'id','data', {((1,2)), ((2,3)), ((4,5))}

 What is the expected behavior when I flatten on that bag? I had
   expected
it
 to result in

 'id','data', (1,2)
 'id','data', (2,3)
 'id','data', (4,5)


 But it appears to me that the result of applying FLATTEN to that
 bag
   is
 this instead:

 'id','data', 1,2
 'id','data', 2,3
 'id','data', 4,5


 The latter is returned by the current cloudera's CDH2 and I've
 seen
   the
 prior behavior on other versions of pig.

 Which is the correct behavior by design?

 What will pig 0.6 do when it is released?

 thanks!

   
   
   
   
--
Zaki Rahaman
   
   
  
  
   --
   Zaki Rahaman
  
 



Re: What should FLATTEN do?

2010-04-02 Thread hc busy
Okay guys some details after some digging. We've got this version of  pig
from CDH2 installed:

hadoop-pig-0.5.0+11.1-1


the list of patches that they applied on top of 0.5.0 are listed here:

http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt

http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txtThe patches
listed there doesn't seem to deal with FLATTEN in any way.

Any suggestions?




On Fri, Apr 2, 2010 at 1:49 PM, hc busy hc.b...@gmail.com wrote:


  yeah, you have to implement outputSchema() method on the udf in order
 to make the content of the tuple visible... There's a nice example in the
 UDF Manual

 http://hadoop.apache.org/pig/docs/r0.6.0/udf.html

 http://hadoop.apache.org/pig/docs/r0.6.0/udf.htmlsearch for 'package
 myudf' until u find it.



 On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 Not sure if this is exactly the same, but when I've created tuples within
 tuples in UDFs (to preserve order of pairs), from bag input, Pig has
 allowed
 it - but I can't work with that data in subsequent steps.

 On Fri, Apr 2, 2010 at 12:37 PM, hc busy hc.b...@gmail.com wrote:

  Yeah, I'm sure it has nested tuples. Pig doesn't natively support
  introduction of tuples
 
  h = foreach g generate ((x,y,z)), (x), x
 
  doesn't work, but i have a udf that does that don't ask why, and
  I've seen it print double pair of paren's when I took a dump.
 
  Our hadoop guys here says it's CDH2 and that the upgrade was just
  re-installation of CDH2... (same jars) But certainly my script
 suddenly
  started doing weird things when it flattened that all the way through.
 
  I'd support the prior behavior as well, because that seems to match my
  reading of documentation on behavior of FLATTEN.
 
 
 
  Has anybody else had this problem with recent cloudera/pig versions?
 
 
  thnx!!
 
 
  On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman zaki.raha...@gmail.com
  wrote:
 
   Stupid question but are you sure your bag has the dual sets of
  parentheses?
   (And if I may ask, why is that the case?)
  
   On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman zaki.raha...@gmail.com
   wrote:
  
If I'm not mistaken, the output is the expected behavior. Flatten
  should
unnest bags. I'm assuming your statement is something like FOREACH
 ...
GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
  first
   two
fields of a tuple for every tuple in the nested bag.
   
   
   
   
On Fri, Apr 2, 2010 at 2:02 PM, hc busy hc.b...@gmail.com wrote:
   
doh s/map/bag/g
   
I seem to get maps and bags mixed up or some reason...
   
Guys, I have a row containing a *bag*
   
'id','data', {((1,2)), ((2,3)), ((4,5))}
   
What is the expected behavior when I flatten on that bag? I had
  expected
it
to result in
   
'id','data', (1,2)
'id','data', (2,3)
'id','data', (4,5)
   
   
But it appears to me that the result of applying FLATTEN to that
 bag
  is
this
instead:
   
'id','data', 1,2
'id','data', 2,3
'id','data', 4,5
   
   
The latter is returned by the current cloudera's CDH2 and I've seen
  the
prior behavior on other versions of pig.
   
Which is the correct behavior by design?
   
What will pig 0.6 do when it is released?
   
thanks!
On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com
 wrote:
   
 Guys, I have a row containing a map

 'id','data', {((1,2)), ((2,3)), ((4,5))}

 What is the expected behavior when I flatten on that bag? I had
   expected
it
 to result in

 'id','data', (1,2)
 'id','data', (2,3)
 'id','data', (4,5)


 But it appears to me that the result of applying FLATTEN to that
 bag
   is
 this instead:

 'id','data', 1,2
 'id','data', 2,3
 'id','data', 4,5


 The latter is returned by the current cloudera's CDH2 and I've
 seen
   the
 prior behavior on other versions of pig.

 Which is the correct behavior by design?

 What will pig 0.6 do when it is released?

 thanks!

   
   
   
   
--
Zaki Rahaman
   
   
  
  
   --
   Zaki Rahaman
  
 





Re: What should FLATTEN do?

2010-04-02 Thread hc busy
The hadoop version:

hadoop-0.20-0.20.1+169.68-1

On Fri, Apr 2, 2010 at 2:33 PM, hc busy hc.b...@gmail.com wrote:

 Okay guys some details after some digging. We've got this version of  pig
 from CDH2 installed:

 hadoop-pig-0.5.0+11.1-1


 the list of patches that they applied on top of 0.5.0 are listed here:

 http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt

 http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txtThe patches
 listed there doesn't seem to deal with FLATTEN in any way.

 Any suggestions?




 On Fri, Apr 2, 2010 at 1:49 PM, hc busy hc.b...@gmail.com wrote:


  yeah, you have to implement outputSchema() method on the udf in order
 to make the content of the tuple visible... There's a nice example in the
 UDF Manual

 http://hadoop.apache.org/pig/docs/r0.6.0/udf.html

 http://hadoop.apache.org/pig/docs/r0.6.0/udf.htmlsearch for 'package
 myudf' until u find it.



 On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney russell.jur...@gmail.com
  wrote:

 Not sure if this is exactly the same, but when I've created tuples within
 tuples in UDFs (to preserve order of pairs), from bag input, Pig has
 allowed
 it - but I can't work with that data in subsequent steps.

 On Fri, Apr 2, 2010 at 12:37 PM, hc busy hc.b...@gmail.com wrote:

  Yeah, I'm sure it has nested tuples. Pig doesn't natively support
  introduction of tuples
 
  h = foreach g generate ((x,y,z)), (x), x
 
  doesn't work, but i have a udf that does that don't ask why,
 and
  I've seen it print double pair of paren's when I took a dump.
 
  Our hadoop guys here says it's CDH2 and that the upgrade was just
  re-installation of CDH2... (same jars) But certainly my script
 suddenly
  started doing weird things when it flattened that all the way through.
 
  I'd support the prior behavior as well, because that seems to match my
  reading of documentation on behavior of FLATTEN.
 
 
 
  Has anybody else had this problem with recent cloudera/pig versions?
 
 
  thnx!!
 
 
  On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman zaki.raha...@gmail.com
  wrote:
 
   Stupid question but are you sure your bag has the dual sets of
  parentheses?
   (And if I may ask, why is that the case?)
  
   On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman zaki.raha...@gmail.com
 
   wrote:
  
If I'm not mistaken, the output is the expected behavior. Flatten
  should
unnest bags. I'm assuming your statement is something like FOREACH
 ...
GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
  first
   two
fields of a tuple for every tuple in the nested bag.
   
   
   
   
On Fri, Apr 2, 2010 at 2:02 PM, hc busy hc.b...@gmail.com wrote:
   
doh s/map/bag/g
   
I seem to get maps and bags mixed up or some reason...
   
Guys, I have a row containing a *bag*
   
'id','data', {((1,2)), ((2,3)), ((4,5))}
   
What is the expected behavior when I flatten on that bag? I had
  expected
it
to result in
   
'id','data', (1,2)
'id','data', (2,3)
'id','data', (4,5)
   
   
But it appears to me that the result of applying FLATTEN to that
 bag
  is
this
instead:
   
'id','data', 1,2
'id','data', 2,3
'id','data', 4,5
   
   
The latter is returned by the current cloudera's CDH2 and I've
 seen
  the
prior behavior on other versions of pig.
   
Which is the correct behavior by design?
   
What will pig 0.6 do when it is released?
   
thanks!
On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com
 wrote:
   
 Guys, I have a row containing a map

 'id','data', {((1,2)), ((2,3)), ((4,5))}

 What is the expected behavior when I flatten on that bag? I had
   expected
it
 to result in

 'id','data', (1,2)
 'id','data', (2,3)
 'id','data', (4,5)


 But it appears to me that the result of applying FLATTEN to that
 bag
   is
 this instead:

 'id','data', 1,2
 'id','data', 2,3
 'id','data', 4,5


 The latter is returned by the current cloudera's CDH2 and I've
 seen
   the
 prior behavior on other versions of pig.

 Which is the correct behavior by design?

 What will pig 0.6 do when it is released?

 thanks!

   
   
   
   
--
Zaki Rahaman
   
   
  
  
   --
   Zaki Rahaman
  
 






download link broken?

2010-03-23 Thread hc busy
the link at

http://hadoop.apache.org/pig/releases.html

Download a release now! http://www.apache.org/dyn/closer.cgi/hadoop/pig

links to a non-existent release.

thnx


Operating on Cogroups and Iterations in Pig Re: more bagging fun

2010-03-12 Thread hc busy
Hmm, okay, I read the documentation further and it appears that this has
already been discussed previously
(herehttp://wiki.apache.org/pig/PigTypesFunctionalSpec).There
seem to be a question of what's the right thing to do. It seems clear to me
though. When an operation like '*' is applied, this is clearly an item-wise
operation that is to be applied to each member of the bag. If a function is
aggregate (SUM), then it operates across an entire bag.

When a COGROUP occurs, just do what SQL does. Which is to say, perform cross
join if an aggregate has been applied across several bags. And do so
automatically, so we don't have to type out the separate FLATTEN's

grouped = COGROUP employee BY name, bonuses BY name;
flattened = FOREACH grouped GENERATE group, *FLATTEN(employee),
FLATTEN(bonuses);grouped_again = GROUP flattened BY group;
total_compensation = FOREACH grouped_again GENERATE group,
SUM(employee:salary * bonuses:multiplier);*

So this should do the same:

grouped = COGROUP employee BY name, bonuses BY name;
total_compensation = FOREACH grouped GENERATE group,
SUM(employee:salary * bonuses:multiplier);


automatically, because that can only have one meaning.

Alternatively, if it is desired to stay with a low-level language, the
solution to all of this confusion around UDF's that take bag's and UDF's
that operate on members of bags can be resolved if we do two things.

1.) Allow UDF's to actually become first class citizens. This way we can
pass UDF's to other UDF's.
2.) introduce the concept of map() and reduce() operator over bags.

This two things allows us more freedom and follows the paradigm of
map-reducing more closely.

grouped = COGROUP employee BY name, bonuses BY name;
total_compensation = FOREACH grouped GENERATE group,
reduce(SUM,map(*,employee::salary,bonuses::multiplier));


Actually, this may deserve a separate keyword. Because map and reduce
operate on single bags where as Pig introduces this concept of co-grouping,
so we should have *comap *and *coreduce* that take functions and operate on
multiple bags that are results of a *cogroup*.

grouped = COGROUP employee BY name, bonuses BY name;
total_compensation = FOREACH grouped GENERATE group,
REDUCE(SUM,COMAP(*, employee::salary,bonuses::multiplier));


This allows us to write efficiently, on one line, what would other wise be
several aliases and unnecessary FLATTENed cross products.

A second thing that I see is the recommendation of implementing looping
constructs. I wonder if I may suggest, as a follow up to the above, that we
beef up UDF's as first class citizens and add the ability to create UDF
functions in Pig Latin with the ability to recurse.

The reason why I think this is a better way to loop than *for(;;)* and *
while(){}* and *do{}while()* statements is that recursive calls are
functional and are more easily optimizable than imperative programming. The
PigJournal http://wiki.apache.org/pig/PigJournal has an entry for all of
these constructs and functions under the heading Extending Pig to Include
Branching, Looping, and Functions, but because map-reduce paradigm is
inherently functional, I would rather think that staying functional would be
a better way to approach this improvement. So the minimal amount of
additional features needed is to implement functions and branching and we
would have loops as a side-effect of those improvements.

In order for the optimizations to be available to PigLatin interpreter, the
functions and branching *must* be implemented within the Pig system. If it
is externalized, or implemented as UDL of some other language, then
opportunities for optimization of the execution vanishes.


Anyways, a couple of cents on a rainy day.




On Wed, Mar 10, 2010 at 10:15 AM, hc busy hc.b...@gmail.com wrote:

 An additional thought... we can define udf's like

 ADD(bag{(int,int)}), DIVIDE(bag{(int,int)}), MULTIPLY(bag{(int,int)}),
 SQRT(bag{(float)})..

 basically vectorize most of the common arithmetic operations, but then the
 language has to support it by converting

 bag.a + bag.b

 to

 ADD(bag.(a,b))

 I guess there are some difficulties, for instance:

 SQRT(bag.a)+bag.b

 How would this work? because sqrt(bag.a) returns a bag, how would we
 convert it to the correct per tuple operation? It's almost like we want to
 convert an expression

 SUM(SQRT(bag.a),bag.b)

 into a function F such that

 SUM(SQRT(bag.a),bag.b) = F(bag.a,bag.b)

 and then the F is computed by iterating through on each tuple of the bag.

 FOREACH ... GENERATE ..., F(bag.(a,b));






 On Wed, Mar 10, 2010 at 9:31 AM, hc busy hc.b...@gmail.com wrote:


 So, pig team, what is the right way to accomplish this?


 On Tue, Mar 9, 2010 at 10:50 PM, Mridul Muralidharan 
 mrid...@yahoo-inc.com wrote:

 On Tuesday 09 March 2010 04:13 AM, hc busy wrote:

 okay. Here's the bag that I have:

  {group: (a: int,b: chararray,c: chararray,d: int), TABLE: {number1:
 int,
 number2:int}}



 and I want to do this

 grunt  CALCULATE= FOREACH

Re: ERROR 6017: Execution failed, while processing

2010-03-10 Thread hc busy
Okay, just a quick update, I eventually found the actual java error from
hadoop logs, but it was equally confusing. It complains of accessing the 4th
element of a tuple that has only one item. But still, it doesn't say which
line of pig latin introduced that error.

I commented out portions of my large pig script until I found the offending
line... I wish there was an easier way to debug this...


On Mon, Mar 8, 2010 at 5:25 PM, hc busy hc.b...@gmail.com wrote:


 Guys, I just ran into a weird exception 500 lines into writing a pig
 script... Below attached is the error. Does anybody have any idea about how
 to debug this? I don't even know which step of my 500 line pig script caused
 this error.

 Any suggestions on how to track down the offending operation?

 Thanks in advance!
 *
 *
 *
 *
 *Pig Stack Trace*
 *---*
 *ERROR 6017: Execution failed, while processing
 hdfs://tasktracker:5/tmp/temp1581022765/tmp939224290,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-1028111033,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-198156265,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-72050900,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-141993299,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp2135611534,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-2093411384,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp250626628,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp2100381358,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp167762091*
 *
 *
 *org.apache.pig.backend.executionengine.ExecException: ERROR 6017:
 Execution failed, while processing
 hdfs://tasktracker:5/tmp/temp1581022765/tmp939224290,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-1028111033,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-198156265,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-72050900,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-141993299,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp2135611534,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp-2093411384,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp250626628,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp2100381358,
 hdfs://tasktracker:5/tmp/temp1581022765/tmp167762091*
 *at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:181)
 *
 *at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 *
 *at
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:777)*
 *at org.apache.pig.PigServer.execute(PigServer.java:770)*
 *at org.apache.pig.PigServer.access$100(PigServer.java:89)*
 *at org.apache.pig.PigServer$Graph.execute(PigServer.java:947)*
 *at org.apache.pig.PigServer.executeBatch(PigServer.java:249)*
 *at
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)*
 *at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
 *
 *at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 *
 *at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)*
 *at org.apache.pig.Main.main(Main.java:320)*
 *
 
 *



ERROR 6017: Execution failed, while processing

2010-03-08 Thread hc busy
Guys, I just ran into a weird exception 500 lines into writing a pig
script... Below attached is the error. Does anybody have any idea about how
to debug this? I don't even know which step of my 500 line pig script caused
this error.

Any suggestions on how to track down the offending operation?

Thanks in advance!
*
*
*
*
*Pig Stack Trace*
*---*
*ERROR 6017: Execution failed, while processing
hdfs://tasktracker:5/tmp/temp1581022765/tmp939224290,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-1028111033,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-198156265,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-72050900,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-141993299,
hdfs://tasktracker:5/tmp/temp1581022765/tmp2135611534,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-2093411384,
hdfs://tasktracker:5/tmp/temp1581022765/tmp250626628,
hdfs://tasktracker:5/tmp/temp1581022765/tmp2100381358,
hdfs://tasktracker:5/tmp/temp1581022765/tmp167762091*
*
*
*org.apache.pig.backend.executionengine.ExecException: ERROR 6017: Execution
failed, while processing
hdfs://tasktracker:5/tmp/temp1581022765/tmp939224290,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-1028111033,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-198156265,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-72050900,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-141993299,
hdfs://tasktracker:5/tmp/temp1581022765/tmp2135611534,
hdfs://tasktracker:5/tmp/temp1581022765/tmp-2093411384,
hdfs://tasktracker:5/tmp/temp1581022765/tmp250626628,
hdfs://tasktracker:5/tmp/temp1581022765/tmp2100381358,
hdfs://tasktracker:5/tmp/temp1581022765/tmp167762091*
*at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:181)
*
*at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
*
*at
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:777)*
*at org.apache.pig.PigServer.execute(PigServer.java:770)*
*at org.apache.pig.PigServer.access$100(PigServer.java:89)*
*at org.apache.pig.PigServer$Graph.execute(PigServer.java:947)*
*at org.apache.pig.PigServer.executeBatch(PigServer.java:249)*
*at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)*
*at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
*
*at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
*
*at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)*
*at org.apache.pig.Main.main(Main.java:320)*
*

*


native types as value of map type? Re: Complex data types as value in a map function

2010-02-24 Thread hc busy
well... I have this data:


[key#'1', b#'2', c#'3', key2#5]
[key#'2', b#'i', c#'m', key2#6]
[key#'3', b#'j', c#'n', key2#7]
[key#'4', b#'k', c#'o', key2#8]

and I run

A= load 'simple_map.data' as (m:map[]);
A2= FOREACH A generate (int)(m#'key2') as key, m;
dump A2

returning

(,[ key2#5, b#'2',key#'1', c#'3'])
(,[ key2#6, b#'i',key#'2', c#'m'])
(,[ key2#7, b#'j',key#'3', c#'n'])
(,[ key2#8, b#'k',key#'4', c#'o'])


I'm looking at PIG-613, but I guess the title is misleading. None of the
casting of value of map works in 0.5.0 I guess if PIG-613 works as
described, I would be in okay shape, because I would be able to cast again
and again using separate aliases...


PIG-613 not  what I meant for pig-1016, but it seems to get me the feature I
want.



On Tue, Jan 5, 2010 at 7:00 PM, Guy Bayes fatal.er...@gmail.com wrote:

 thanks Thejas, that thread helped out immensely.

 Also great to see Santhosh remembered that nasty PIG 880 bug with the type
 inference causing an integer overflow, which coincidentally enough I also
 got stung by at one time.

 in the meantime, while I would love to have complex map datatypes,
 certainly
 can be worked around using other methods

 appreciate the prompt response
 Guy


 On Tue, Jan 5, 2010 at 10:38 AM, Thejas Nair te...@yahoo-inc.com wrote:

  This is an issue in PigStorage  is present in recent versions of pig. Ie
  you
  cannot have complex types (bag, tuple, map) as a value in map type, if
 you
  are using PigStorage .
  See - https://issues.apache.org/jira/browse/PIG-1016
 
  -Thejas
 
 
  On 1/5/10 10:28 AM, Alan Gates ga...@yahoo-inc.com wrote:
 
   It should be supported.  You may need to explicitly cast it to a tuple
   so Pig knows to treat it as a tuple.  Can you send the scripts that
   are giving the error?
  
   Alan.
  
   On Jan 4, 2010, at 9:10 PM, Guy Bayes wrote:
  
   Is this supported?
  
   Say I have a map
  
   [f2#(1,6)]
  
   I cannot figure out how to de-reference the (1,6) tuple, I either
   get type
   conversion failure and  () returned, or a 1066 error message ERROR
   1066:
   Unable to open iterator for alias
  
   thanks
   Guy
  
 
 


 --
 you may be acquainted with the night
 but i have seen the darkness in the day
 and you must know it is a terrifying sight...



[jira] Resolved: (PIG-1082) Modify Comparator to work with a typed textual Storage

2010-02-24 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy resolved PIG-1082.
--

Resolution: Fixed

 Modify Comparator to work with a typed textual Storage
 --

 Key: PIG-1082
 URL: https://issues.apache.org/jira/browse/PIG-1082
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.4.0
Reporter: hc busy
   Original Estimate: 5h
  Remaining Estimate: 5h

 See parent bug. This ticket is for just the comparator change, which needs to 
 be made in order for the nested data structures to sort right

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1082) Modify Comparator to work with a typed textual Storage

2010-02-24 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1082:
-

Attachment: (was: PIG-1082.patch)

 Modify Comparator to work with a typed textual Storage
 --

 Key: PIG-1082
 URL: https://issues.apache.org/jira/browse/PIG-1082
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.4.0
Reporter: hc busy
   Original Estimate: 5h
  Remaining Estimate: 5h

 See parent bug. This ticket is for just the comparator change, which needs to 
 be made in order for the nested data structures to sort right

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1083) Build separate Storage to read in hiearchical data

2010-01-05 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1083:
-

Summary: Build separate Storage to read in hiearchical data  (was: Build 
separator Storage to read in hiearchical data)

 Build separate Storage to read in hiearchical data
 --

 Key: PIG-1083
 URL: https://issues.apache.org/jira/browse/PIG-1083
 Project: Pig
  Issue Type: Sub-task
Reporter: hc busy
   Original Estimate: 5h
  Remaining Estimate: 5h

 See parent ticket

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2010-01-05 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796893#action_12796893
 ] 

hc busy commented on PIG-1016:
--

Hi Thejas, Olga, and rest, it sounds about right. I think PIG-1082 is ready 
from my previous effort, and PIG-1083 still needs to be done. And perhaps it 
will more sense to use avro or some other binary format instead.

I still have an ASCII nested datastructure to read in, but It's not very HP. 
Not sure if anybody needs it any more.

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.7.0

 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-11-10 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Status: Open  (was: Patch Available)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.5.0

 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1082) Modify Comparator to work with a typed textual Storage

2009-11-10 Thread hc busy (JIRA)
Modify Comparator to work with a typed textual Storage
--

 Key: PIG-1082
 URL: https://issues.apache.org/jira/browse/PIG-1082
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.0.0


See parent bug. This ticket is for just the comparator change, which needs to 
be made in order for the nested data structures to sort right

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1083) Build separator Storage to read in hiearchical data

2009-11-10 Thread hc busy (JIRA)
Build separator Storage to read in hiearchical data
---

 Key: PIG-1083
 URL: https://issues.apache.org/jira/browse/PIG-1083
 Project: Pig
  Issue Type: Sub-task
Reporter: hc busy


See parent ticket

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1082) Modify Comparator to work with a typed textual Storage

2009-11-10 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1082:
-

Attachment: PIG-1082.patch

changes only the comparator

 Modify Comparator to work with a typed textual Storage
 --

 Key: PIG-1082
 URL: https://issues.apache.org/jira/browse/PIG-1082
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.0.0

 Attachments: PIG-1082.patch

   Original Estimate: 5h
  Remaining Estimate: 5h

 See parent bug. This ticket is for just the comparator change, which needs to 
 be made in order for the nested data structures to sort right

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-29 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771308#action_12771308
 ] 

hc busy commented on PIG-1016:
--

Well, I'd like to start by thanking everyone for the attention and support! As 
a first time contributor, I feel my heart warmed by the encouraging comments 
and serious time everyone is spending on my problem. I also greatly appreciate 
the patience everybody has, and of course I am perpetually grateful for 
everybody's work in making this all work.


Line by line, 
{code}
+// find bug is complaining about nulls. This check sequence will 
prevent nulls from being dereferenced.
+if(o1!=null  o2!=null){
...
+}else{
+  if(o1==null  o2==null){rc=0;}
+  else if(o1==null) {rc=-1;}
+  else{ rc=1; }
{code}

Does what it says, it prevents a findbug warning. non-null is greater than null 
by convention.

{code}
+// In case the objects are comparable
+if((o1 instanceof NullableBytesWritable  o2 instanceof 
NullableBytesWritable)||
+   !(o1 instanceof PigNullableWritable  o2 instanceof 
PigNullableWritable)
+){
+
+  NullableBytesWritable nbw1 = (NullableBytesWritable)o1;
+  NullableBytesWritable nbw2 = (NullableBytesWritable)o2;
+  
+  // If either are null, handle differently.
+  if (!nbw1.isNull()  !nbw2.isNull()) {
+  rc = 
((DataByteArray)nbw1.getValueAsPigType()).compareTo((DataByteArray)nbw2.getValueAsPigType());
+  } else {
+  // For sorting purposes two nulls are equal.
+  if (nbw1.isNull()  nbw2.isNull()) rc = 0;
+  else if (nbw1.isNull()) rc = -1;
+  else rc = 1;
+  }
+}
{code}


The if statement takes us outside of original comparison code (enclosed in 
outer if above) ONLY if both compratee are PigNullableWritable that are not 
NullableBytesWritable. This may seem confusing at first glance, but what it 
does is do the identical thing as before the patch except for the new case that 
I introduced by allowing other types.

The code is awkward, as Santhosh noted. But I am not too sure I understand the 
original implementation. But certainly, this way, we preserve original behavior 
and for new cases that this patch introduces, they are handled in the remaining 
else:

{code}
else{
+  // enter here only if both o1 and o2 are 
non-NullableByteWritable PigNullableWritable's
+  PigNullableWritable nbw1 = (PigNullableWritable)o1;
+  PigNullableWritable nbw2 = (PigNullableWritable)o2;
+  // If either are null, handle differently.
+  if (!nbw1.isNull()  !nbw2.isNull()) {
+  rc = nbw1.compareTo(nbw2);
+  } else {
+  // For sorting purposes two nulls are equal.
+  if (nbw1.isNull()  nbw2.isNull()) rc = 0;
+  else if (nbw1.isNull()) rc = -1;
+  else rc = 1;
+  }
+}
{code}


This is the safest way I can think of writing this code, and I have been able 
to order by a value begotten out of a map. Also, join and then sort keyed on 
values of maps both works. 


I guess something that flows better might be the following:

{code}
if(o1!=null  o2!=null){
 
if((o1 instanceof PigNullableWritable  o2 instanceof 
PigNullableWritable ){
  PigNullableWritable nbw1 = (PigNullableWritable)o1;
  PigNullableWritable nbw2 = (PigNullableWritable)o2;
  // If either are null, handle differently.
  if (!nbw1.isNull()  !nbw2.isNull()) {
  rc = nbw1.compareTo(nbw2);
  } else {
  // For sorting purposes two nulls are equal.
  if (nbw1.isNull()  nbw2.isNull()) rc = 0;
  else if (nbw1.isNull()) rc = -1;
  else rc = 1;
  }
}else{
  throw new Exception(bad compare);
}
}else{
  if(o1==null  o2==null){rc=0;}
  else if(o1==null) {rc=-1;}
  else{ rc=1; }
{code}

But I must admit that I don't know what the right thing to do is. I don't know 
the design well enough to know if throwing an exception is the appropriate 
thing? Or something else? And would the last code block perform the right 
comparison in place of the original function?


lmk of your thoughts on improvements to the patch.




 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.5.0

[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-29 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771571#action_12771571
 ] 

hc busy commented on PIG-1016:
--

Thejas, great point! 

Run time detection of type does use more time at run time and require more 
discipline to use. 

But I'd like to point out that the original implementation seemed to have 
allowed for this in PigStorage. The change to reduce the types that can be 
stored in the value of a map seems to reduce functionality of Pig. 

I guess the one case where I want to use map is when I have a sparse tuple, 
that I don't want to type in a type for each of the many fields. Because if I 
went to that trouble, I'd just write java code, or use something where schema 
is statically defined and stored. 

say, for simple example, self join of one row 

{{\[data1#\[score#15l,unique_id#100\],data2#\[score#15,foreign#00100\]\]}} 

{code} 
B = join A by m#data1#unique_id, A by m#data2#foriegn 
C = Filter B by $0#score=$1#score 
{code} 

I'd think something like this should work without me typing in the entire type 
structure. 


Also, what happens when BinStorage returns a map with value that isn't a 
bytearray, does the comparison fail? 


 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.5.0

 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-28 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Fix Version/s: (was: 0.4.0)
   0.5.0
   Status: Patch Available  (was: Open)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.5.0

 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-28 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771170#action_12771170
 ] 

hc busy commented on PIG-1016:
--

Okay, trying to get this into a release of pig... I noticed 0.4 came , but 
nothing has happened on this ticket.

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.5.0

 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-19 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: (was: PIG-1016.patch)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy

 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-19 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: PIG-1016.patch

Same patch as before, but the hash seems different. maybe I submitted the wrong 
patch previously.

d337d3264bf5e6e925515ceff90718e10

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-19 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767387#action_12767387
 ] 

hc busy commented on PIG-1016:
--

%...@#$, had me sweating for a while..., as mentioned previously, this is 
functionality that I'd like to use... not just fun weekend project... hehe..

thnx.

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-17 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: (was: PIG-1016.patch)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy

 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-17 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: PIG-1016.patch

Re-attaching patch. It seems my previous patch didn't pass _any_ unit tests. 

Ouch! Anyway, I ran a few unit tests, they still pass on my machine. I've been 
accused of having crap on my machine that make programs pass their unit 
tests Hopefully those accusations were false, and when the unit test passes 
on my machine, it passes on the build machines too.

4b425...904b2

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-16 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766710#action_12766710
 ] 

hc busy commented on PIG-1016:
--

'kay, since my last comment, I've verified that in trunk, the patch in this 
ticket did not introduce an error. the Skewed join (correct or not) is 
returning the same number of rows when data read in is from a nested data 
structure as data read in from a tuple.

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1016) Reading in map data seems broken

2009-10-15 Thread hc busy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766202#action_12766202
 ] 

hc busy commented on PIG-1016:
--

I skimed PIG-880. Here is a simplified version of what I might need to do:


bash% cat map.dat 
[a#2,b#'d',c#(1,2,3)]
[a#1,b#'a',c#(1,2,3)]
[a#3,b#'c',c#(1,2,3)]
bash% PIG
gruntA= load 'map.dat' as (data:map[]);
gruntB= foreach A generate (int)(data#'a'), 
(chararray)(data#'b'),(tuple())(data#'c');
gruntC= order B by $0;
gruntdump C;
(1,'a',(1,2,3))
(2,'d',(1,2,3))
(3,'c',(1,2,3))
gruntD= order B by $1;
gruntdump D;
(1,'a',(1,2,3))
(3,'c',(1,2,3))
(2,'d',(1,2,3))

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-15 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: (was: PIG-1016.patch)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy

 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-15 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: PIG-1016.patch

Submitting patch to work-around both PIG-880 and PIG-1016

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-13 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: (was: PIG-1016.patch)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy

 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-13 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Status: Open  (was: Patch Available)

Didn't pass a few other affected unit tests

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-13 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Status: Patch Available  (was: Open)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-13 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: PIG-1016.patch

Sorry, first time contributor. This submit includes the fix and fixes several 
unit tests that failed

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)
Reading in map data seems broken


 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy


Hi, I'm trying to load a map that has a tuple for value. The read fails in 
0.4.0 because of a misconfiguration in the parser. Where as in almost all 
documentation it is stated that value of the map can be any time.

I've attached a patch that allows us to read in complex objects as value as 
documented. I've done simple verification of loading in maps with tuple/map 
values and writing them back out using LOAD and STORE. All seems to work fine.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Status: Patch Available  (was: Open)

% diff org/apache/pig/data/parser/TextDataParser.jjt 
org/apache/pig/data/parser/newTextDataParser.jjt
145c145
   String value = null;
---
   Object value = null;
149c149
   (key = StringDatum() # value = StringDatum())
---
   (key = StringDatum() # value = Datum())
151c151
   keyValues.put(key, new DataByteArray(value.getBytes(UTF-8)));
---
   keyValues.put(key, value);


 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: map_to_any_value.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: map_to_any_value.patch

A patch for org/apache/pig/data/parser/TextDataParser.jjt

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: map_to_any_value.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: (was: map_to_any_value.patch)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy

 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: trunk_map_to_any_value.patch

Including a patch via svn diff.

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: trunk_map_to_any_value.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: PIG-1016.patch

rename

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: (was: trunk_map_to_any_value.patch)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Status: Open  (was: Patch Available)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: (was: PIG-1016.patch)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy

 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: PIG-1016.patch

This patch is generated with svndiff and has a unit test

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: (was: PIG-1016.patch)

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy

 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-10-12 Thread hc busy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hc busy updated PIG-1016:
-

Attachment: PIG-1016.patch

Unit test plus patch. This time unit test actually passes.

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.