[jira] Created: (PIG-1465) Filter inside foreach is broken
Filter inside foreach is broken --- Key: PIG-1465 URL: https://issues.apache.org/jira/browse/PIG-1465 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: hc busy {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; } describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1465) Filter inside foreach is broken
[ https://issues.apache.org/jira/browse/PIG-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1465: - Description: {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b\{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; \} describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} was: {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; } describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} Filter inside foreach is broken --- Key: PIG-1465 URL: https://issues.apache.org/jira/browse/PIG-1465 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: hc busy {quote} % cat data.txt x,a,1,a x,a,2,a x,a,3,b x,a,4,b y,a,1,a y,a,2,a y,a,3,b y,a,4,b % cat script.pig a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray); b = group a by ind; describe b; f = foreach b\{ all_total = SUM(a.num); fed = filter a by (f1==f2); some_total = (int)SUM(fed.num); generate group as ind, all_total, some_total; \} describe f; dump f; % pig -f script.pig (x,a,1,a,,) (x,a,2,a,,) (x,a,3,b,,) (x,a,4,b,,) (y,a,1,a,,) (y,a,2,a,,) (y,a,3,b,,) (y,a,4,b,,) % cat what_I_expected (x,10,3) (y,10,3) {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: does EvalFunc generate the entire bag always ?
well, see that's the thing, the 'sort A by $0' is already nlg(n) ahh, I see, my own example suffers from this problem. I guess I'm wondering how 'limit' works in conjunction with UDF's... A practical application escapes me right now, But if I do C = foreach B{ C1 = MyUdf(B.bag_on_b); C2 = limit C1 5; } does it know to push limit in this case? On Thu, May 27, 2010 at 2:32 PM, Alan Gates ga...@yahoo-inc.com wrote: The default case is that a UDFs that take bags (such as COUNT, etc.) are handed the entire bag at once. In the case where all UDFs in a foreach implement the algebraic interface and the expression itself is algebraic than the combiner will be used, thus significantly limiting the size of the bag handed to the UDF. The accumulator does hand records to the UDF a few thousand at a time. Currently it has no way to turn off the flow of records. What you want might be accomplished by the LIMIT operator, which can be used inside a nested foreach. Something like: C = foreach B { C1 = sort A by $0; C2 = limit 5 C1; generate myUDF(C2); } Alan. On May 26, 2010, at 11:59 AM, hc busy wrote: Hey, guys, how are Bags passed to EvalFunc stored? I was looking at the Accumulator interface and it says that the reason why this needed for COUNT and SUM is because EvalFunc always gives you the entire bag when the EvalFunc is run on a bag. I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code inside that does for(Tuple entry:inputDataBag){ stuff } was an actual iterator that iterated on the bag sequentially without necessarily having the entire bag in memory all at once. ?? Because it's an iterator, so there's no way to do anything other than to stream through it. I'm looking at this because Accumulator has no way of telling Pig I've seen enough It streams through the entire bag no matter what happens. (like, hypothetically speaking, if I was writing 5th item of a sorted bag udf), after I see 5th of a 5 million entry bag, I want to stop executing if possible. Is there a easy way to make this happen?
does EvalFunc generate the entire bag always ?
Hey, guys, how are Bags passed to EvalFunc stored? I was looking at the Accumulator interface and it says that the reason why this needed for COUNT and SUM is because EvalFunc always gives you the entire bag when the EvalFunc is run on a bag. I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code inside that does for(Tuple entry:inputDataBag){ stuff } was an actual iterator that iterated on the bag sequentially without necessarily having the entire bag in memory all at once. ?? Because it's an iterator, so there's no way to do anything other than to stream through it. I'm looking at this because Accumulator has no way of telling Pig I've seen enough It streams through the entire bag no matter what happens. (like, hypothetically speaking, if I was writing 5th item of a sorted bag udf), after I see 5th of a 5 million entry bag, I want to stop executing if possible. Is there a easy way to make this happen?
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869720#action_12869720 ] hc busy commented on PIG-1150: -- similarly, there's some code here on numerically stable and distributed calculation: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance I mean, while we're at it, why not calculate all central moments? {code} centralMoments(x, y) {code} returns central moments of x up to y {code} centralMoments(x,3) {code} will return a tuple containing (mean, variance, skew) VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
need help again, what causes Cannot cast to Unknown ?
Hey, guys, I managed to generate another horrendous error message (before the plan completes). What typically causes this error to happen? The script survives through all describes. (I can describe after all assignments to aliases), but it still produces this error. (running pit 0.5 on hadoop .20) 2010-05-03 22:54:22,054 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1051: Cannot cast to Unknown 2010-05-03 22:54:22,054 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83) at org.apache.pig.PigServer.compileLp(PigServer.java:818) at org.apache.pig.PigServer.compileLp(PigServer.java:789) at org.apache.pig.PigServer.execute(PigServer.java:758) at org.apache.pig.PigServer.access$100(PigServer.java:89) at org.apache.pig.PigServer$Graph.execute(PigServer.java:947) at org.apache.pig.PigServer.executeBatch(PigServer.java:249) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:320) Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1060: Cannot resolve Join output schema at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2360) at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:201) at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 14 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1051: Cannot cast to Unknown at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForJoinInnerPlan(TypeCheckingVisitor.java:2544) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2348) ... 19 more
Re: how to compare?
I'm not sure. If the type of two things that I am comparing (typically same field of tuples inside a bag) I expect it to throw an error instead of ordering the results by the datatype. Because if it doesn't, it will either error out later on in the pigscript or it will be serialized out and some program and that program will read in offending field and crash. I'd prefer it fail early than late. Which is why I'm just casting to Comparable and calling compareTo. The problem with that is that it depends on each of the Comparable's compareTo method to handle errors in similar ways. and I see that it does by calling into DataType.compare(circa l166 in DataByteArray for BYTEARRAY's...) ahh I see, so by casting to comparable it does the same as DataType.compare when the types are different. H, I guess I want to stick to casting to Comparables, since the two ways of calling them are identical. Unless people have other comments. On Wed, Apr 28, 2010 at 3:57 AM, Gianmarco gianmarco@gmail.com wrote: Basically, DataType.compare() just calls the compareTo() method of the two objects after checking that the two types are the same. However, DataType.compare() does 2 things more than a simple compareTo(). Firts, it is specialized for Maps, for which sizes are taken into account and keys are sorted. Second, it imposes an (arbitrary) order on different data types. In this way the types are not dependent on each other and there is a single point of control. So I think you should use DataType.compare() unless you are sure you do not need these features. Anyway, there is something that I do not understand. What I do not understand is why the function needs to switch on the datatype byte and cast the objects before calling the compareTo on them. Just casting them to Comparable and letting Java run the proper polymorphic method should work as well, right? On Wed, Apr 28, 2010 at 07:12, hc busy hc.b...@gmail.com wrote: guys, I'm implementing that ExtremalTupleByNthField and I have a question about comparison... So, when I have parsed out the two objects that I want to compare how do I perform that comparison? My current implementation assumes the data is Comparable (which they invariably are within pig) so I do int c = ((Comparable)o1).compareTo((Comparable)o2); now I also see that there's another compare that compares the two objects by: int c = DataType.compare(o1, o2, DataType.findType(o1), DataType.findType(o2)); The initial methods works for all types I've tried (int, string, etc.) But the latter is used by another UDF already in SVN. What are your suggestions? (PIG-1386 is ticket where I've checked in the patch).
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Status: Open (was: Patch Available) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Status: Patch Available (was: Open) Fix Version/s: 0.8.0 UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Fix For: 0.8.0 Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Status: Open (was: Patch Available) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Fix For: 0.8.0 Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: (was: PIG-1386-trunk.patch) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Fix For: 0.8.0 Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: PIG-1386-trunk.patch e503949c4f5f2667657ee02872aff5ce Additional documentation and examples. UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Fix For: 0.8.0 Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Status: Patch Available (was: Open) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Fix For: 0.8.0 Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: PIG-1386-trunk.patch da673ab2d584faf903e8b49b63a03ade spell check the documentation UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Fix For: 0.8.0 Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: (was: PIG-1386-trunk.patch) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Fix For: 0.8.0 Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException
how to compare?
guys, I'm implementing that ExtremalTupleByNthField and I have a question about comparison... So, when I have parsed out the two objects that I want to compare how do I perform that comparison? My current implementation assumes the data is Comparable (which they invariably are within pig) so I do int c = ((Comparable)o1).compareTo((Comparable)o2); now I also see that there's another compare that compares the two objects by: int c = DataType.compare(o1, o2, DataType.findType(o1), DataType.findType(o2)); The initial methods works for all types I've tried (int, string, etc.) But the latter is used by another UDF already in SVN. What are your suggestions? (PIG-1386 is ticket where I've checked in the patch).
[jira] Commented: (PIG-1303) unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
[ https://issues.apache.org/jira/browse/PIG-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861248#action_12861248 ] hc busy commented on PIG-1303: -- +(google^2) that worked! unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor Key: PIG-1303 URL: https://issues.apache.org/jira/browse/PIG-1303 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Environment: pig 0.6.0 on a fedora linux machine, jdk 1.6 u11 Reporter: Johannes Rußek Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0, 0.8.0 Attachments: PIG-1303.patch, TypeCheckingVisitor.java.diff I'm unable to set the format of the outgoing date string in the constructor as it's supposed to work. The only way i could change the format was to change the default in the java class and rebuild piggybank. Apparently this has something to do with the way pig instantiates DateExtractor, quoting a replier on the mailing list: David Vrensk said: I ran into the same problem a couple of weeks ago, and played around with the code inserting some print/log statements. It turns out that the arguments are only used in the initial constructor calls, when the pig process is starting, but once pig reaches the point where it would use the udf, it creates new DateExtractors without passing the arguments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: (was: PIG-1386-trunk.patch) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: PIG-1386-trunk.patch 1873fb8d75f7362df343615f623a7390 Added documentation, added a bunch of unit tests to test the functionalities that the documentation claims to have. cleaned up to revert to not requiring change to EvalFunc's constructor. Added ASF license text. UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag
[jira] Commented: (PIG-1385) UDF to create tuples and bags
[ https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860488#action_12860488 ] hc busy commented on PIG-1385: -- yeah! my first contrib. Thanks, Alan!! UDF to create tuples and bags - Key: PIG-1385 URL: https://issues.apache.org/jira/browse/PIG-1385 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Fix For: 0.8.0 Attachments: PIG-1385-trunk.patch Original Estimate: 24h Remaining Estimate: 24h Based on this conversation: On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: PIG-1386-trunk.patch 163812d67299dd4b44470c854c80f2a8 redo without the addition of the helper function. UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: (was: PIG-1386-trunk.patch) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL
[jira] Commented: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860490#action_12860490 ] hc busy commented on PIG-1386: -- Okay, here's an alternative. What if we did this instead: {code} class EvalFunc{ ... protected String parameters=; public EvalFunc(Object... constructorParameters){ StringBuilder sb = new StringBuilder(); if(constructorParameters!=null constructorParameters.length0){ for(Object o:constructorParameters){ sb.append(','); sb.append('\''); sb.append(o.toString()); sb.append('\'');} parameters=(+sb.substring(1)+); } ... //rest of evalfunc constructor. ... } {code} and my getInitial is implemented thusly: {code} @Override public String getInitial() { return HelperClass.class.getName() + parameters; } {code} UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override
[jira] Commented: (PIG-1303) unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
[ https://issues.apache.org/jira/browse/PIG-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860297#action_12860297 ] hc busy commented on PIG-1303: -- But the problem is that inside the EvalFunc constructor, in case of Algebraic classes, it constructs each of Initial, Intermediate and final which are EvalFunc's that, in my case, require a parameter to operate correctly. If I declare the helper class that represent the initial/intermediate/final {code} public class HelperClass extends EvalFuncTuple { public HelperClass() { super(); } public Tuple exec(Tuple input) throws IOException { return extreme(fieldIndex, sign, input, reporter); } } {code} where the fieldIndex and sign come from the surrounding class (note the class is not static) then the code crashes. It's not able to construct the HelperClass with this error {quote} could not instantiate 'org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField$HelperClass' with arguments 'null' java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField$HelperClass' with arguments 'null' at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:498) at org.apache.pig.EvalFunc.getReturnTypeFromSpec(EvalFunc.java:136) at org.apache.pig.EvalFunc.init(EvalFunc.java:123) at org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField.init(ExtremalTupleByNthField.java:77) at org.apache.pig.piggybank.evaluation.TestExtremalTupleByNthField.testMin(Unknown Source) Caused by: java.lang.InstantiationException: org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField$HelperClass at java.lang.Class.newInstance0(Class.java:340) at java.lang.Class.newInstance(Class.java:308) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:468) {quote} Basically, I think it's not able to construct because the class can only be constructed from an instance of ExtremalTupleByNthField. {code} ExtremalTupleByNthField etbnf = new ExtremalTupleByNthField(1,max); etbnf.new ExtremalTupleByNthField.HelperClass(); {code} So my solution to this problem was to make this class static. But make it so that EvalFunc can take a vararg that will eventually contain the actual parameters. the handleChildConstructorParameters method in the EvalFunc will construct a string that represents the call into the initial/intermediate/final methods but it contains parameters that came from the ExtremalTupleByNthField. unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor Key: PIG-1303 URL: https://issues.apache.org/jira/browse/PIG-1303 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Environment: pig 0.6.0 on a fedora linux machine, jdk 1.6 u11 Reporter: Johannes Rußek Assignee: Dmitriy V. Ryaboy Attachments: TypeCheckingVisitor.java.diff I'm unable to set the format of the outgoing date string in the constructor as it's supposed to work. The only way i could change the format was to change the default in the java class and rebuild piggybank. Apparently this has something to do with the way pig instantiates DateExtractor, quoting a replier on the mailing list: David Vrensk said: I ran into the same problem a couple of weeks ago, and played around with the code inserting some print/log statements. It turns out that the arguments are only used in the initial constructor calls, when the pig process is starting, but once pig reaches the point where it would use the udf, it creates new DateExtractors without passing the arguments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1303) unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
[ https://issues.apache.org/jira/browse/PIG-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860310#action_12860310 ] hc busy commented on PIG-1303: -- Hmm, okay, so let me shorten my problem. Basically the functions getInitial, getIntermed, and getFinal in my Algebraic class doesn't have access to the constructor parameters. The reason is this. in Java, the super() constructor can only be called as the very first thing that the deriving class's constructor does, so my udfs has constructors that look like this: {code} public ExtremalTupleByNthField(String fieldIndexString, String order) { super(); parameters = ('+fieldIndexString+','+order+'; } @Override public String getInitial() { return HelperClass.class.getName()+parameters; } {code} But the problem is EvalFunc() constructor calls the child class's getInitial() to type check. When it does this, it finds that my getInitial() returns something in complete because the parameters member variable hasn't been initialized yet. This is a pretty mundane problem with java programs and the way to fix it is what I've submitted in the patch calling an overridden method in the super()'s constructor. I mean, I don't see any other way to do this, but I'd be willing to work on another implementation if you can suggest one? unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor Key: PIG-1303 URL: https://issues.apache.org/jira/browse/PIG-1303 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Environment: pig 0.6.0 on a fedora linux machine, jdk 1.6 u11 Reporter: Johannes Rußek Assignee: Dmitriy V. Ryaboy Attachments: TypeCheckingVisitor.java.diff I'm unable to set the format of the outgoing date string in the constructor as it's supposed to work. The only way i could change the format was to change the default in the java class and rebuild piggybank. Apparently this has something to do with the way pig instantiates DateExtractor, quoting a replier on the mailing list: David Vrensk said: I ran into the same problem a couple of weeks ago, and played around with the code inserting some print/log statements. It turns out that the arguments are only used in the initial constructor calls, when the pig process is starting, but once pig reaches the point where it would use the udf, it creates new DateExtractors without passing the arguments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860325#action_12860325 ] hc busy commented on PIG-1386: -- oops, posted to the wrong ticket: {quote} Hmm, okay, so let me shorten my problem. Basically the functions getInitial(), getIntermed(), and getFinal() in my Algebraic class doesn't have access to the constructor parameters. The reason is this. in Java, the super() constructor can only be called as the very first thing that the deriving class's constructor does, so my udfs has constructors that look like this: {code} public ExtremalTupleByNthField(String fieldIndexString, String order) { super(); parameters = ('+fieldIndexString+','+order+'; } @Override public String getInitial() { return HelperClass.class.getName()+parameters; } {code} But the problem is EvalFunc() constructor initializes the EvalFunc as returned by getInitial() to type check. When it does this, it finds that my getInitial() returns something incomplete because the parameters member variable hasn't been initialized yet. This is a pretty mundane problem with java programs and the way to fix it is what I've submitted in the patch calling an overridden method in the super()'s constructor. I mean, I don't see any other way to do this, but I'd be willing to work on another implementation if you can suggest one? {quote} UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: (was: PIG-1386-trunk.patch) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: PIG-1386-trunk.patch 25ce97367cadfd2ea4be379c6f5c351d Clean up documentation and refactor to unify parsing of constructor arguments in the two classes. UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before
[jira] Commented: (PIG-1385) UDF to create tuples and bags
[ https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860365#action_12860365 ] hc busy commented on PIG-1385: -- ok, ok, moving tests to evaluation.util requires that you import the classes under test. Here we usually have tests in the same package (but sitting under test/ instead of src/) so we can test package protected methods. Also so we don't have to import the CUT. But other than that, I guess I should follow convention. I agree with these changes. UDF to create tuples and bags - Key: PIG-1385 URL: https://issues.apache.org/jira/browse/PIG-1385 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1385-trunk.patch Original Estimate: 24h Remaining Estimate: 24h Based on this conversation: On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: (was: PIG-1386-trunk.patch) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: PIG-1386-trunk.patch checked to be sure the unittest builds and runs. UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import
[jira] Updated: (PIG-1385) UDF to create tuples and bags
[ https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1385: - Attachment: (was: PIG-1385-trunk.patch) UDF to create tuples and bags - Key: PIG-1385 URL: https://issues.apache.org/jira/browse/PIG-1385 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Original Estimate: 24h Remaining Estimate: 24h Based on this conversation: On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm() + ' using PigStorage(',') + as (id:int, a:chararray, b:chararray, c:chararray, d:chararray
[jira] Updated: (PIG-1385) UDF to create tuples and bags
[ https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1385: - Status: Open (was: Patch Available) UDF to create tuples and bags - Key: PIG-1385 URL: https://issues.apache.org/jira/browse/PIG-1385 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1385-trunk.patch Original Estimate: 24h Remaining Estimate: 24h Based on this conversation: On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm() + ' using PigStorage(',') + as (id:int, a:chararray, b:chararray
[jira] Updated: (PIG-1385) UDF to create tuples and bags
[ https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1385: - Attachment: PIG-1385-trunk.patch changed so that the unit test builds and runs. UDF to create tuples and bags - Key: PIG-1385 URL: https://issues.apache.org/jira/browse/PIG-1385 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1385-trunk.patch Original Estimate: 24h Remaining Estimate: 24h Based on this conversation: On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm() + ' using PigStorage
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Status: Open (was: Patch Available) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Status: Patch Available (was: Open) resubmitting patch for the build system to check. UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import
[jira] Commented: (PIG-1303) unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
[ https://issues.apache.org/jira/browse/PIG-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860103#action_12860103 ] hc busy commented on PIG-1303: -- Okay, so, here's a thought: I'm kind of stuck writing the initial/intermed/Final methods for an algebraic EvalFunc that has constructor parameters because I couldn't pass the parameters in. A suggestion is to do this (without being incompatible with previous versions) Alter EvalFunc's profile so that {code} public abstract class EvalFuncT { protected handleChildConstructorParameters(Object... childConstructor){ // by default do nothing. } public EvalFunc(Object... constructorParameters){ handleChildConstructorParameters(constructorParameters); ... then do everything else it used to do. } } {code} The reason why this is necessary is because I'll need to overrite handleChildConstructorParameters in my Algebraic EvalFunc to do some things before the rest of EvalFunc()'s constructor continues. This will help fix this date format problem for Algebraic evalfunc's. unable to set outgoing format for org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor Key: PIG-1303 URL: https://issues.apache.org/jira/browse/PIG-1303 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Environment: pig 0.6.0 on a fedora linux machine, jdk 1.6 u11 Reporter: Johannes Rußek Assignee: Dmitriy V. Ryaboy Attachments: TypeCheckingVisitor.java.diff I'm unable to set the format of the outgoing date string in the constructor as it's supposed to work. The only way i could change the format was to change the default in the java class and rebuild piggybank. Apparently this has something to do with the way pig instantiates DateExtractor, quoting a replier on the mailing list: David Vrensk said: I ran into the same problem a couple of weeks ago, and played around with the code inserting some print/log statements. It turns out that the arguments are only used in the initial constructor calls, when the pig process is starting, but once pig reaches the point where it would use the udf, it creates new DateExtractors without passing the arguments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860104#action_12860104 ] hc busy commented on PIG-1386: -- I'm having trouble writing this UDF because of the bug similar to PIG-1303; Here's my comment to that ticket below. It seems that by doing this, it allows me to pass on the constructor parameters: {quote} Okay, so, here's a thought: I'm kind of stuck writing the initial/intermed/Final methods for an algebraic EvalFunc that has constructor parameters because I couldn't pass the parameters in. A suggestion is to do this (without being incompatible with previous versions) Alter EvalFunc's profile so that {code} public abstract class EvalFuncT { protected handleChildConstructorParameters(Object... childConstructor){ // by default do nothing. } public EvalFunc(Object... constructorParameters){ handleChildConstructorParameters(constructorParameters); ... then do everything else it used to do. } } {code} The reason why this is necessary is because I'll need to overrite handleChildConstructorParameters in my Algebraic EvalFunc to do some things before the rest of EvalFunc()'s constructor continues. This will help fix this date format problem for Algebraic evalfunc's. {quote} UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: PIG-1386-trunk.patch a92218b0c641363439af8f2d9e5ecbc0 UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: (was: PIG-1386-trunk.patch) UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Assignee: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL
[jira] Updated: (PIG-1385) UDF to create tuples and bags
[ https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1385: - Affects Version/s: 0.6.0 Description: Based on this conversation: On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm() + ' using PigStorage(',') + as (id:int, a:chararray, b:chararray, c:chararray, d:chararray);); pigServer.registerQuery(last = foreach a generate flatten( + ToBag.class.getName() + (2, id, a, id, b, id, c));); pigServer.deleteFile(target/pigtest/func1.txt); pigServer.store(last, target/pigtest/func1.txt); assertTrue(pigServer.fileSize(target/pigtest/func1.txt) 0); } } was: Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're
[jira] Created: (PIG-1387) Syntactical Sugar for PIG-1385
Syntactical Sugar for PIG-1385 -- Key: PIG-1387 URL: https://issues.apache.org/jira/browse/PIG-1387 Project: Pig Issue Type: Wish Components: grunt Affects Versions: 0.6.0 Reporter: hc busy From this conversation, extend PIG-1385 to instead of calling UDF use built-in behavior when the (),{},[] groupings are encountered. What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm() + ' using PigStorage(',') + as (id:int, a:chararray, b:chararray, c:chararray, d:chararray);); pigServer.registerQuery(last = foreach a generate flatten( + ToBag.class.getName() + (2, id, a, id, b, id, c));); pigServer.deleteFile(target/pigtest/func1.txt); pigServer.store(last, target/pigtest/func1.txt); assertTrue(pigServer.fileSize(target/pigtest/func1.txt) 0); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Status: Patch Available (was: Open) Here's a first stab. UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue
[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField
[ https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1386: - Attachment: PIG-1386-trunk.patch The patch UDF to extend functionalities of MaxTupleBy1stField --- Key: PIG-1386 URL: https://issues.apache.org/jira/browse/PIG-1386 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Attachments: PIG-1386-trunk.patch Based on this conversation: totally, go for it, it'd be pretty straightforward to add this functionality. - Hide quoted text - On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote: Hey, while we're on the subject, and I have your attention, can we re-factor the UDF MaxTupleByFirstField to take constructor? *define customMaxTuple ExtremalTupleByNthField(n, 'min');* *G = group T by id;* *M = foreach T generate customMaxTuple(T); * Where n is the nth field, and the second parameter allows us to specify min, max, median, etc... Does this seem like something useful to everyone? On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static
[jira] Updated: (PIG-1385) UDF to create tuples and bags
[ https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1385: - Attachment: PIG-1385-trunk.patch UDF to create tuples and bags - Key: PIG-1385 URL: https://issues.apache.org/jira/browse/PIG-1385 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Attachments: PIG-1385-trunk.patch Original Estimate: 24h Remaining Estimate: 24h Based on this conversation: On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm() + ' using PigStorage(',') + as (id:int, a:chararray, b:chararray, c:chararray, d:chararray
[jira] Updated: (PIG-1385) UDF to create tuples and bags
[ https://issues.apache.org/jira/browse/PIG-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1385: - Status: Patch Available (was: Open) UDF to create tuples and bags - Key: PIG-1385 URL: https://issues.apache.org/jira/browse/PIG-1385 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.6.0 Reporter: hc busy Attachments: PIG-1385-trunk.patch Original Estimate: 24h Remaining Estimate: 24h Based on this conversation: On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote: What about making them part of the language using symbols? instead of foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7; have language support foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7; or even: foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11; Is there reason not to do the second or third other than being more complicated? Certainly I'd volunteer to put the top implementation in to the util package and submit them for builtin's, but the latter syntactic candies seems more natural.. On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote: The grouping package in piggybank is left over from back when Pig allowed users to define grouping functions (0.1). Functions like these should go in evaluation.util. However, I'd consider putting these in builtin (in main Pig) instead. These are things everyone asks for and they seem like a reasonable addition to the core engine. This will be more of a burden to write (as we'll hold them to a higher standard) but of more use to people as well. Alan. On Apr 19, 2010, at 12:53 PM, hc busy wrote: Some times I wonder... I mean, somebody went to the trouble of making a path called org.apache.pig.piggybank.grouping (where it seems like this code belong), but didn't check in any java code into that package. Any comment about where to put this kind of utility classes? On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote: 2010/4/19 hc busy hc.b...@gmail.com That's just the way it is right now, you can't make bags or tuples directly... Maybe we should have some UDF's in piggybank for these: toBag() toTuple(); --which is kinda like exec(Tuple in){return in;} TupleToBag(); --some times you need it this way for some reason. Ok. I place my current code here, may be later I make a patch (if such implementation is acceptable of course). import org.apache.pig.EvalFunc; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import java.io.IOException; /** * Convert any sequence of fields to bag with specified count of fieldsbr * Schema: count:int, fld1 [, fld2, fld3, fld4... ]. * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... } * * @author astepachev */ public class ToBag extends EvalFuncDataBag { public BagFactory bagFactory; public TupleFactory tupleFactory; public ToBag() { bagFactory = BagFactory.getInstance(); tupleFactory = TupleFactory.getInstance(); } @Override public DataBag exec(Tuple input) throws IOException { if (input.isNull()) return null; final DataBag bag = bagFactory.newDefaultBag(); final Integer couter = (Integer) input.get(0); if (couter == null) return null; Tuple tuple = tupleFactory.newTuple(); for (int i = 0; i input.size() - 1; i++) { if (i % couter == 0) { tuple = tupleFactory.newTuple(); bag.add(tuple); } tuple.append(input.get(i + 1)); } return bag; } } import org.apache.pig.ExecType; import org.apache.pig.PigServer; import org.junit.Before; import org.junit.Test; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import static org.junit.Assert.assertTrue; /** * @author astepachev */ public class ToBagTest { PigServer pigServer; URL inputTxt; @Before public void init() throws IOException, URISyntaxException { pigServer = new PigServer(ExecType.LOCAL); inputTxt = this.getClass().getResource(bagTest.txt).toURI().toURL(); } @Test public void testSimple() throws IOException { pigServer.registerQuery(a = load ' + inputTxt.toExternalForm() + ' using PigStorage(',') + as (id:int, a:chararray, b:chararray, c:chararray, d:chararray
Re: incorrect Inner Join result for multi column join with null values in join key
Cool! can't wait until CDH has 0.7... Kinda surprised that nobody encountered this problem before... Can I file a ticket? On Fri, Apr 16, 2010 at 10:21 AM, Alan Gates ga...@yahoo-inc.com wrote: On Apr 16, 2010, at 9:37 AM, hc busy wrote: What scott noticed is present when the multiple column join key is used in a distributed setting. The trap is that when you unit test the behavior/PIG script and it does the join right in a local environment and then you get F'ed after u deploy to production in distributed enviro. In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll avoid that will fix these issues with development and deployment differences. Alan. On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey sc...@richrelevance.com wrote: CDH2 Pig 0.5+. Mapred mode, with CDH2 0.20.1+ Both latest as of 2 weeks ago. Joins on multiple columns have null key values matching. IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, f2:int); IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, f2:int); --- both the above are the same dump IN; (,1,2,3) (,-5,5,5) ( ,100,200,300) ( ,0,200,300) (a,4,5,6) (a,7,8,9) (b,10,11,12) (b,11,11,12) IN_NULLS = FILTER IN BY ind is NULL; dump IN_NULLS; (,1,2,3) (,-5,5,5) J1 = JOIN IN by (ind), IN2 by (ind); dump J1; ( ,0,200,300, ,0,200,300) (a,4,5,6,a,4,5,6) (a,4,5,6,a,7,8,9) (a,7,8,9,a,4,5,6) (a,7,8,9,a,7,8,9) ( ,100,200,300, ,100,200,300) (b,10,11,12,b,10,11,12) (b,10,11,12,b,11,11,12) (b,11,11,12,b,10,11,12) (b,11,11,12,b,11,11,12) The above is the expected result of the self-join on the first column. J2 = JOIN IN by (ind, ts) IN2 by (ind, ts); dump J2; ( ,0,200,300, ,0,200,300) ( ,100,200,300, ,100,200,300) (a,4,5,6,a,4,5,6) (a,7,8,9,a,7,8,9) (b,10,11,12,b,10,11,12) (b,11,11,12,b,11,11,12) (,-5,5,5,,-5,5,5) (,1,2,3,,1,2,3) The above is incorrect, since it matched the rows that have NULL for the ind field. There is a work-around, by explicitly filtering for null on the join columns before the join, but the above still looks incorrect to me. I suspect it is fixed in 0.6 or later, but I have not been able to find a JIRA ticket or message on this list about this.
Re: Begin a discussion about Pig as a top level project
The Twitter office is cushier and has more bars within stumbling distance. Just sayin'. and strip clubs too, I gather there're a couple on Market... near civic bart stop ;-) oh... hey, you guys are at a nice place... lot's of night clubs near there too . Given that, do you think it makes sense to say that Pig stays a subproject for now, but if it someday grows beyond Hadoop only it becomes a TLP? I could agree to that stance. Oops, I didn't read your whole message... I think TLP could be part of the roadmap: Planned publicity, like planned pregnancy, is a good thing. And on the way there, we should add dedicated resource that updates documentation and links on the website... :-) On Mon, Apr 5, 2010 at 12:10 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: The Twitter office is cushier and has more bars within stumbling distance. Just sayin'. To the subject at hand -- I don't think TLP standing has the PR value you think it does... feature set, velocity of development, adoption, flexibility, etc -- those are far more important. -Dmitriy On Mon, Apr 5, 2010 at 11:58 AM, hc busy hc.b...@gmail.com wrote: Of course I'd love it if someday there is an ISO Pig Latin committee (with meetings in cool exotic places) deciding the official standard for Pig Latin. haha!!! Some exotic place like Yahoo's HQ in sunny Sunnyvale California? I guess it feels like it depends on the roadmap more than roadmap depends on it. In terms of positioning, a TLP would appear to potential users who are evaluating alternatives to consider it as _the_ choice as opposed to one of the choices. If the ambition is to take it there, then TLP, as useless as it may seem right now, might actually be worth the effort to attain. I mean, would you rather wait until Hive makes TLP and then play catch up? I mean, I can kinda see them doing that... On Mon, Apr 5, 2010 at 11:36 AM, Alan Gates ga...@yahoo-inc.com wrote: Prognostication is a difficult business. Of course I'd love it if someday there is an ISO Pig Latin committee (with meetings in cool exotic places) deciding the official standard for Pig Latin. But that seems like saying in your start up's business plan, When we reach Google's size, then we'll do x. If there ever is an ISO Pig Latin standard it will be years off. As others have noted, staying tight to Hadoop now has many advantages, both in technical and adoption terms. Hence my advocacy of keeping Pig Latin Hadoop agnostic while tightly integrating the backend. Which is to say that in my view, Pig is Hadoop specific now, but there may come a day when that is no longer true. Whether Pig will ever move past just running on Hadoop to running in other parallel systems won't be known for years to come. Given that, do you think it makes sense to say that Pig stays a subproject for now, but if it someday grows beyond Hadoop only it becomes a TLP? I could agree to that stance. Alan. On Apr 3, 2010, at 12:43 PM, Santhosh Srinivasan wrote: I see this as a multi-part question. Looking back at some of the significant roadmap/existential questions asked in the last 12 months, I see the following: 1. With the introduction of SQL, what is the philosophy of Pig (I sent an email about this approximately 9 months ago) 2. What is the approach to support backward compatibility in Pig (Alan had sent an email about this 3 months ago) 3. Should Pig be a TLP (the current email thread). Here is my take on answering the aforementioned questions. The initial philosophy of Pig was to be backend agnostic. It was designed as a data flow language. Whenever a new language is designed, the syntax and semantics of the language have to be laid out. The syntax is usually captured in the form of a BNF grammar. The semantics are defined by the language creators. Backward compatibility is then a question of holding true to the syntax and semantics. With Pig, in addition to the language, the Java APIs were exposed to customers to implement UDFs (load/store/filter/grouping/row transformation etc), provision looping since the language does not support looping constructs and also support a programmatic mode of access. Backward compatibility in this context is to support API versioning. Do we still intend to position as a data flow language that is backend agnostic? If the answer is yes, then there is a strong case for making Pig a TLP. Are we influenced by Hadoop? A big YES! The reason Pig chose to become a Hadoop sub-project was to ride the Hadoop popularity wave. As a consequence, we chose to be heavily influenced by the Hadoop roadmap. Like a good lawyer, I also have rebuttals to Alan's questions :) 1. Search engine popularity - We can discuss this with the Hadoop team and still retain links to TLP's
What should FLATTEN do?
Guys, I have a row containing a map 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks!
Re: What should FLATTEN do?
doh s/map/bag/g I seem to get maps and bags mixed up or some reason... Guys, I have a row containing a *bag* 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com wrote: Guys, I have a row containing a map 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks!
Re: What should FLATTEN do?
Yeah, I'm sure it has nested tuples. Pig doesn't natively support introduction of tuples h = foreach g generate ((x,y,z)), (x), x doesn't work, but i have a udf that does that don't ask why, and I've seen it print double pair of paren's when I took a dump. Our hadoop guys here says it's CDH2 and that the upgrade was just re-installation of CDH2... (same jars) But certainly my script suddenly started doing weird things when it flattened that all the way through. I'd support the prior behavior as well, because that seems to match my reading of documentation on behavior of FLATTEN. Has anybody else had this problem with recent cloudera/pig versions? thnx!! On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman zaki.raha...@gmail.comwrote: Stupid question but are you sure your bag has the dual sets of parentheses? (And if I may ask, why is that the case?) On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman zaki.raha...@gmail.com wrote: If I'm not mistaken, the output is the expected behavior. Flatten should unnest bags. I'm assuming your statement is something like FOREACH ... GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first two fields of a tuple for every tuple in the nested bag. On Fri, Apr 2, 2010 at 2:02 PM, hc busy hc.b...@gmail.com wrote: doh s/map/bag/g I seem to get maps and bags mixed up or some reason... Guys, I have a row containing a *bag* 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com wrote: Guys, I have a row containing a map 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! -- Zaki Rahaman -- Zaki Rahaman
Re: What should FLATTEN do?
yeah, you have to implement outputSchema() method on the udf in order to make the content of the tuple visible... There's a nice example in the UDF Manual http://hadoop.apache.org/pig/docs/r0.6.0/udf.html http://hadoop.apache.org/pig/docs/r0.6.0/udf.htmlsearch for 'package myudf' until u find it. On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney russell.jur...@gmail.comwrote: Not sure if this is exactly the same, but when I've created tuples within tuples in UDFs (to preserve order of pairs), from bag input, Pig has allowed it - but I can't work with that data in subsequent steps. On Fri, Apr 2, 2010 at 12:37 PM, hc busy hc.b...@gmail.com wrote: Yeah, I'm sure it has nested tuples. Pig doesn't natively support introduction of tuples h = foreach g generate ((x,y,z)), (x), x doesn't work, but i have a udf that does that don't ask why, and I've seen it print double pair of paren's when I took a dump. Our hadoop guys here says it's CDH2 and that the upgrade was just re-installation of CDH2... (same jars) But certainly my script suddenly started doing weird things when it flattened that all the way through. I'd support the prior behavior as well, because that seems to match my reading of documentation on behavior of FLATTEN. Has anybody else had this problem with recent cloudera/pig versions? thnx!! On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman zaki.raha...@gmail.com wrote: Stupid question but are you sure your bag has the dual sets of parentheses? (And if I may ask, why is that the case?) On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman zaki.raha...@gmail.com wrote: If I'm not mistaken, the output is the expected behavior. Flatten should unnest bags. I'm assuming your statement is something like FOREACH ... GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first two fields of a tuple for every tuple in the nested bag. On Fri, Apr 2, 2010 at 2:02 PM, hc busy hc.b...@gmail.com wrote: doh s/map/bag/g I seem to get maps and bags mixed up or some reason... Guys, I have a row containing a *bag* 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com wrote: Guys, I have a row containing a map 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! -- Zaki Rahaman -- Zaki Rahaman
Re: What should FLATTEN do?
Okay guys some details after some digging. We've got this version of pig from CDH2 installed: hadoop-pig-0.5.0+11.1-1 the list of patches that they applied on top of 0.5.0 are listed here: http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txtThe patches listed there doesn't seem to deal with FLATTEN in any way. Any suggestions? On Fri, Apr 2, 2010 at 1:49 PM, hc busy hc.b...@gmail.com wrote: yeah, you have to implement outputSchema() method on the udf in order to make the content of the tuple visible... There's a nice example in the UDF Manual http://hadoop.apache.org/pig/docs/r0.6.0/udf.html http://hadoop.apache.org/pig/docs/r0.6.0/udf.htmlsearch for 'package myudf' until u find it. On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney russell.jur...@gmail.comwrote: Not sure if this is exactly the same, but when I've created tuples within tuples in UDFs (to preserve order of pairs), from bag input, Pig has allowed it - but I can't work with that data in subsequent steps. On Fri, Apr 2, 2010 at 12:37 PM, hc busy hc.b...@gmail.com wrote: Yeah, I'm sure it has nested tuples. Pig doesn't natively support introduction of tuples h = foreach g generate ((x,y,z)), (x), x doesn't work, but i have a udf that does that don't ask why, and I've seen it print double pair of paren's when I took a dump. Our hadoop guys here says it's CDH2 and that the upgrade was just re-installation of CDH2... (same jars) But certainly my script suddenly started doing weird things when it flattened that all the way through. I'd support the prior behavior as well, because that seems to match my reading of documentation on behavior of FLATTEN. Has anybody else had this problem with recent cloudera/pig versions? thnx!! On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman zaki.raha...@gmail.com wrote: Stupid question but are you sure your bag has the dual sets of parentheses? (And if I may ask, why is that the case?) On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman zaki.raha...@gmail.com wrote: If I'm not mistaken, the output is the expected behavior. Flatten should unnest bags. I'm assuming your statement is something like FOREACH ... GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first two fields of a tuple for every tuple in the nested bag. On Fri, Apr 2, 2010 at 2:02 PM, hc busy hc.b...@gmail.com wrote: doh s/map/bag/g I seem to get maps and bags mixed up or some reason... Guys, I have a row containing a *bag* 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com wrote: Guys, I have a row containing a map 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! -- Zaki Rahaman -- Zaki Rahaman
Re: What should FLATTEN do?
The hadoop version: hadoop-0.20-0.20.1+169.68-1 On Fri, Apr 2, 2010 at 2:33 PM, hc busy hc.b...@gmail.com wrote: Okay guys some details after some digging. We've got this version of pig from CDH2 installed: hadoop-pig-0.5.0+11.1-1 the list of patches that they applied on top of 0.5.0 are listed here: http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txtThe patches listed there doesn't seem to deal with FLATTEN in any way. Any suggestions? On Fri, Apr 2, 2010 at 1:49 PM, hc busy hc.b...@gmail.com wrote: yeah, you have to implement outputSchema() method on the udf in order to make the content of the tuple visible... There's a nice example in the UDF Manual http://hadoop.apache.org/pig/docs/r0.6.0/udf.html http://hadoop.apache.org/pig/docs/r0.6.0/udf.htmlsearch for 'package myudf' until u find it. On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney russell.jur...@gmail.com wrote: Not sure if this is exactly the same, but when I've created tuples within tuples in UDFs (to preserve order of pairs), from bag input, Pig has allowed it - but I can't work with that data in subsequent steps. On Fri, Apr 2, 2010 at 12:37 PM, hc busy hc.b...@gmail.com wrote: Yeah, I'm sure it has nested tuples. Pig doesn't natively support introduction of tuples h = foreach g generate ((x,y,z)), (x), x doesn't work, but i have a udf that does that don't ask why, and I've seen it print double pair of paren's when I took a dump. Our hadoop guys here says it's CDH2 and that the upgrade was just re-installation of CDH2... (same jars) But certainly my script suddenly started doing weird things when it flattened that all the way through. I'd support the prior behavior as well, because that seems to match my reading of documentation on behavior of FLATTEN. Has anybody else had this problem with recent cloudera/pig versions? thnx!! On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman zaki.raha...@gmail.com wrote: Stupid question but are you sure your bag has the dual sets of parentheses? (And if I may ask, why is that the case?) On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman zaki.raha...@gmail.com wrote: If I'm not mistaken, the output is the expected behavior. Flatten should unnest bags. I'm assuming your statement is something like FOREACH ... GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first two fields of a tuple for every tuple in the nested bag. On Fri, Apr 2, 2010 at 2:02 PM, hc busy hc.b...@gmail.com wrote: doh s/map/bag/g I seem to get maps and bags mixed up or some reason... Guys, I have a row containing a *bag* 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! On Fri, Apr 2, 2010 at 11:29 AM, hc busy hc.b...@gmail.com wrote: Guys, I have a row containing a map 'id','data', {((1,2)), ((2,3)), ((4,5))} What is the expected behavior when I flatten on that bag? I had expected it to result in 'id','data', (1,2) 'id','data', (2,3) 'id','data', (4,5) But it appears to me that the result of applying FLATTEN to that bag is this instead: 'id','data', 1,2 'id','data', 2,3 'id','data', 4,5 The latter is returned by the current cloudera's CDH2 and I've seen the prior behavior on other versions of pig. Which is the correct behavior by design? What will pig 0.6 do when it is released? thanks! -- Zaki Rahaman -- Zaki Rahaman
download link broken?
the link at http://hadoop.apache.org/pig/releases.html Download a release now! http://www.apache.org/dyn/closer.cgi/hadoop/pig links to a non-existent release. thnx
Operating on Cogroups and Iterations in Pig Re: more bagging fun
Hmm, okay, I read the documentation further and it appears that this has already been discussed previously (herehttp://wiki.apache.org/pig/PigTypesFunctionalSpec).There seem to be a question of what's the right thing to do. It seems clear to me though. When an operation like '*' is applied, this is clearly an item-wise operation that is to be applied to each member of the bag. If a function is aggregate (SUM), then it operates across an entire bag. When a COGROUP occurs, just do what SQL does. Which is to say, perform cross join if an aggregate has been applied across several bags. And do so automatically, so we don't have to type out the separate FLATTEN's grouped = COGROUP employee BY name, bonuses BY name; flattened = FOREACH grouped GENERATE group, *FLATTEN(employee), FLATTEN(bonuses);grouped_again = GROUP flattened BY group; total_compensation = FOREACH grouped_again GENERATE group, SUM(employee:salary * bonuses:multiplier);* So this should do the same: grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, SUM(employee:salary * bonuses:multiplier); automatically, because that can only have one meaning. Alternatively, if it is desired to stay with a low-level language, the solution to all of this confusion around UDF's that take bag's and UDF's that operate on members of bags can be resolved if we do two things. 1.) Allow UDF's to actually become first class citizens. This way we can pass UDF's to other UDF's. 2.) introduce the concept of map() and reduce() operator over bags. This two things allows us more freedom and follows the paradigm of map-reducing more closely. grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, reduce(SUM,map(*,employee::salary,bonuses::multiplier)); Actually, this may deserve a separate keyword. Because map and reduce operate on single bags where as Pig introduces this concept of co-grouping, so we should have *comap *and *coreduce* that take functions and operate on multiple bags that are results of a *cogroup*. grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, REDUCE(SUM,COMAP(*, employee::salary,bonuses::multiplier)); This allows us to write efficiently, on one line, what would other wise be several aliases and unnecessary FLATTENed cross products. A second thing that I see is the recommendation of implementing looping constructs. I wonder if I may suggest, as a follow up to the above, that we beef up UDF's as first class citizens and add the ability to create UDF functions in Pig Latin with the ability to recurse. The reason why I think this is a better way to loop than *for(;;)* and * while(){}* and *do{}while()* statements is that recursive calls are functional and are more easily optimizable than imperative programming. The PigJournal http://wiki.apache.org/pig/PigJournal has an entry for all of these constructs and functions under the heading Extending Pig to Include Branching, Looping, and Functions, but because map-reduce paradigm is inherently functional, I would rather think that staying functional would be a better way to approach this improvement. So the minimal amount of additional features needed is to implement functions and branching and we would have loops as a side-effect of those improvements. In order for the optimizations to be available to PigLatin interpreter, the functions and branching *must* be implemented within the Pig system. If it is externalized, or implemented as UDL of some other language, then opportunities for optimization of the execution vanishes. Anyways, a couple of cents on a rainy day. On Wed, Mar 10, 2010 at 10:15 AM, hc busy hc.b...@gmail.com wrote: An additional thought... we can define udf's like ADD(bag{(int,int)}), DIVIDE(bag{(int,int)}), MULTIPLY(bag{(int,int)}), SQRT(bag{(float)}).. basically vectorize most of the common arithmetic operations, but then the language has to support it by converting bag.a + bag.b to ADD(bag.(a,b)) I guess there are some difficulties, for instance: SQRT(bag.a)+bag.b How would this work? because sqrt(bag.a) returns a bag, how would we convert it to the correct per tuple operation? It's almost like we want to convert an expression SUM(SQRT(bag.a),bag.b) into a function F such that SUM(SQRT(bag.a),bag.b) = F(bag.a,bag.b) and then the F is computed by iterating through on each tuple of the bag. FOREACH ... GENERATE ..., F(bag.(a,b)); On Wed, Mar 10, 2010 at 9:31 AM, hc busy hc.b...@gmail.com wrote: So, pig team, what is the right way to accomplish this? On Tue, Mar 9, 2010 at 10:50 PM, Mridul Muralidharan mrid...@yahoo-inc.com wrote: On Tuesday 09 March 2010 04:13 AM, hc busy wrote: okay. Here's the bag that I have: {group: (a: int,b: chararray,c: chararray,d: int), TABLE: {number1: int, number2:int}} and I want to do this grunt CALCULATE= FOREACH
Re: ERROR 6017: Execution failed, while processing
Okay, just a quick update, I eventually found the actual java error from hadoop logs, but it was equally confusing. It complains of accessing the 4th element of a tuple that has only one item. But still, it doesn't say which line of pig latin introduced that error. I commented out portions of my large pig script until I found the offending line... I wish there was an easier way to debug this... On Mon, Mar 8, 2010 at 5:25 PM, hc busy hc.b...@gmail.com wrote: Guys, I just ran into a weird exception 500 lines into writing a pig script... Below attached is the error. Does anybody have any idea about how to debug this? I don't even know which step of my 500 line pig script caused this error. Any suggestions on how to track down the offending operation? Thanks in advance! * * * * *Pig Stack Trace* *---* *ERROR 6017: Execution failed, while processing hdfs://tasktracker:5/tmp/temp1581022765/tmp939224290, hdfs://tasktracker:5/tmp/temp1581022765/tmp-1028111033, hdfs://tasktracker:5/tmp/temp1581022765/tmp-198156265, hdfs://tasktracker:5/tmp/temp1581022765/tmp-72050900, hdfs://tasktracker:5/tmp/temp1581022765/tmp-141993299, hdfs://tasktracker:5/tmp/temp1581022765/tmp2135611534, hdfs://tasktracker:5/tmp/temp1581022765/tmp-2093411384, hdfs://tasktracker:5/tmp/temp1581022765/tmp250626628, hdfs://tasktracker:5/tmp/temp1581022765/tmp2100381358, hdfs://tasktracker:5/tmp/temp1581022765/tmp167762091* * * *org.apache.pig.backend.executionengine.ExecException: ERROR 6017: Execution failed, while processing hdfs://tasktracker:5/tmp/temp1581022765/tmp939224290, hdfs://tasktracker:5/tmp/temp1581022765/tmp-1028111033, hdfs://tasktracker:5/tmp/temp1581022765/tmp-198156265, hdfs://tasktracker:5/tmp/temp1581022765/tmp-72050900, hdfs://tasktracker:5/tmp/temp1581022765/tmp-141993299, hdfs://tasktracker:5/tmp/temp1581022765/tmp2135611534, hdfs://tasktracker:5/tmp/temp1581022765/tmp-2093411384, hdfs://tasktracker:5/tmp/temp1581022765/tmp250626628, hdfs://tasktracker:5/tmp/temp1581022765/tmp2100381358, hdfs://tasktracker:5/tmp/temp1581022765/tmp167762091* *at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:181) * *at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) * *at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:777)* *at org.apache.pig.PigServer.execute(PigServer.java:770)* *at org.apache.pig.PigServer.access$100(PigServer.java:89)* *at org.apache.pig.PigServer$Graph.execute(PigServer.java:947)* *at org.apache.pig.PigServer.executeBatch(PigServer.java:249)* *at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)* *at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172) * *at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) * *at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)* *at org.apache.pig.Main.main(Main.java:320)* * *
ERROR 6017: Execution failed, while processing
Guys, I just ran into a weird exception 500 lines into writing a pig script... Below attached is the error. Does anybody have any idea about how to debug this? I don't even know which step of my 500 line pig script caused this error. Any suggestions on how to track down the offending operation? Thanks in advance! * * * * *Pig Stack Trace* *---* *ERROR 6017: Execution failed, while processing hdfs://tasktracker:5/tmp/temp1581022765/tmp939224290, hdfs://tasktracker:5/tmp/temp1581022765/tmp-1028111033, hdfs://tasktracker:5/tmp/temp1581022765/tmp-198156265, hdfs://tasktracker:5/tmp/temp1581022765/tmp-72050900, hdfs://tasktracker:5/tmp/temp1581022765/tmp-141993299, hdfs://tasktracker:5/tmp/temp1581022765/tmp2135611534, hdfs://tasktracker:5/tmp/temp1581022765/tmp-2093411384, hdfs://tasktracker:5/tmp/temp1581022765/tmp250626628, hdfs://tasktracker:5/tmp/temp1581022765/tmp2100381358, hdfs://tasktracker:5/tmp/temp1581022765/tmp167762091* * * *org.apache.pig.backend.executionengine.ExecException: ERROR 6017: Execution failed, while processing hdfs://tasktracker:5/tmp/temp1581022765/tmp939224290, hdfs://tasktracker:5/tmp/temp1581022765/tmp-1028111033, hdfs://tasktracker:5/tmp/temp1581022765/tmp-198156265, hdfs://tasktracker:5/tmp/temp1581022765/tmp-72050900, hdfs://tasktracker:5/tmp/temp1581022765/tmp-141993299, hdfs://tasktracker:5/tmp/temp1581022765/tmp2135611534, hdfs://tasktracker:5/tmp/temp1581022765/tmp-2093411384, hdfs://tasktracker:5/tmp/temp1581022765/tmp250626628, hdfs://tasktracker:5/tmp/temp1581022765/tmp2100381358, hdfs://tasktracker:5/tmp/temp1581022765/tmp167762091* *at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:181) * *at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) * *at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:777)* *at org.apache.pig.PigServer.execute(PigServer.java:770)* *at org.apache.pig.PigServer.access$100(PigServer.java:89)* *at org.apache.pig.PigServer$Graph.execute(PigServer.java:947)* *at org.apache.pig.PigServer.executeBatch(PigServer.java:249)* *at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)* *at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172) * *at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) * *at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)* *at org.apache.pig.Main.main(Main.java:320)* * *
native types as value of map type? Re: Complex data types as value in a map function
well... I have this data: [key#'1', b#'2', c#'3', key2#5] [key#'2', b#'i', c#'m', key2#6] [key#'3', b#'j', c#'n', key2#7] [key#'4', b#'k', c#'o', key2#8] and I run A= load 'simple_map.data' as (m:map[]); A2= FOREACH A generate (int)(m#'key2') as key, m; dump A2 returning (,[ key2#5, b#'2',key#'1', c#'3']) (,[ key2#6, b#'i',key#'2', c#'m']) (,[ key2#7, b#'j',key#'3', c#'n']) (,[ key2#8, b#'k',key#'4', c#'o']) I'm looking at PIG-613, but I guess the title is misleading. None of the casting of value of map works in 0.5.0 I guess if PIG-613 works as described, I would be in okay shape, because I would be able to cast again and again using separate aliases... PIG-613 not what I meant for pig-1016, but it seems to get me the feature I want. On Tue, Jan 5, 2010 at 7:00 PM, Guy Bayes fatal.er...@gmail.com wrote: thanks Thejas, that thread helped out immensely. Also great to see Santhosh remembered that nasty PIG 880 bug with the type inference causing an integer overflow, which coincidentally enough I also got stung by at one time. in the meantime, while I would love to have complex map datatypes, certainly can be worked around using other methods appreciate the prompt response Guy On Tue, Jan 5, 2010 at 10:38 AM, Thejas Nair te...@yahoo-inc.com wrote: This is an issue in PigStorage is present in recent versions of pig. Ie you cannot have complex types (bag, tuple, map) as a value in map type, if you are using PigStorage . See - https://issues.apache.org/jira/browse/PIG-1016 -Thejas On 1/5/10 10:28 AM, Alan Gates ga...@yahoo-inc.com wrote: It should be supported. You may need to explicitly cast it to a tuple so Pig knows to treat it as a tuple. Can you send the scripts that are giving the error? Alan. On Jan 4, 2010, at 9:10 PM, Guy Bayes wrote: Is this supported? Say I have a map [f2#(1,6)] I cannot figure out how to de-reference the (1,6) tuple, I either get type conversion failure and () returned, or a 1066 error message ERROR 1066: Unable to open iterator for alias thanks Guy -- you may be acquainted with the night but i have seen the darkness in the day and you must know it is a terrifying sight...
[jira] Resolved: (PIG-1082) Modify Comparator to work with a typed textual Storage
[ https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy resolved PIG-1082. -- Resolution: Fixed Modify Comparator to work with a typed textual Storage -- Key: PIG-1082 URL: https://issues.apache.org/jira/browse/PIG-1082 Project: Pig Issue Type: Sub-task Affects Versions: 0.4.0 Reporter: hc busy Original Estimate: 5h Remaining Estimate: 5h See parent bug. This ticket is for just the comparator change, which needs to be made in order for the nested data structures to sort right -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1082) Modify Comparator to work with a typed textual Storage
[ https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1082: - Attachment: (was: PIG-1082.patch) Modify Comparator to work with a typed textual Storage -- Key: PIG-1082 URL: https://issues.apache.org/jira/browse/PIG-1082 Project: Pig Issue Type: Sub-task Affects Versions: 0.4.0 Reporter: hc busy Original Estimate: 5h Remaining Estimate: 5h See parent bug. This ticket is for just the comparator change, which needs to be made in order for the nested data structures to sort right -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1083) Build separate Storage to read in hiearchical data
[ https://issues.apache.org/jira/browse/PIG-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1083: - Summary: Build separate Storage to read in hiearchical data (was: Build separator Storage to read in hiearchical data) Build separate Storage to read in hiearchical data -- Key: PIG-1083 URL: https://issues.apache.org/jira/browse/PIG-1083 Project: Pig Issue Type: Sub-task Reporter: hc busy Original Estimate: 5h Remaining Estimate: 5h See parent ticket -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796893#action_12796893 ] hc busy commented on PIG-1016: -- Hi Thejas, Olga, and rest, it sounds about right. I think PIG-1082 is ready from my previous effort, and PIG-1083 still needs to be done. And perhaps it will more sense to use avro or some other binary format instead. I still have an ASCII nested datastructure to read in, but It's not very HP. Not sure if anybody needs it any more. Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.7.0 Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Status: Open (was: Patch Available) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.5.0 Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1082) Modify Comparator to work with a typed textual Storage
Modify Comparator to work with a typed textual Storage -- Key: PIG-1082 URL: https://issues.apache.org/jira/browse/PIG-1082 Project: Pig Issue Type: Sub-task Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.0.0 See parent bug. This ticket is for just the comparator change, which needs to be made in order for the nested data structures to sort right -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1083) Build separator Storage to read in hiearchical data
Build separator Storage to read in hiearchical data --- Key: PIG-1083 URL: https://issues.apache.org/jira/browse/PIG-1083 Project: Pig Issue Type: Sub-task Reporter: hc busy See parent ticket -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1082) Modify Comparator to work with a typed textual Storage
[ https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1082: - Attachment: PIG-1082.patch changes only the comparator Modify Comparator to work with a typed textual Storage -- Key: PIG-1082 URL: https://issues.apache.org/jira/browse/PIG-1082 Project: Pig Issue Type: Sub-task Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.0.0 Attachments: PIG-1082.patch Original Estimate: 5h Remaining Estimate: 5h See parent bug. This ticket is for just the comparator change, which needs to be made in order for the nested data structures to sort right -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771308#action_12771308 ] hc busy commented on PIG-1016: -- Well, I'd like to start by thanking everyone for the attention and support! As a first time contributor, I feel my heart warmed by the encouraging comments and serious time everyone is spending on my problem. I also greatly appreciate the patience everybody has, and of course I am perpetually grateful for everybody's work in making this all work. Line by line, {code} +// find bug is complaining about nulls. This check sequence will prevent nulls from being dereferenced. +if(o1!=null o2!=null){ ... +}else{ + if(o1==null o2==null){rc=0;} + else if(o1==null) {rc=-1;} + else{ rc=1; } {code} Does what it says, it prevents a findbug warning. non-null is greater than null by convention. {code} +// In case the objects are comparable +if((o1 instanceof NullableBytesWritable o2 instanceof NullableBytesWritable)|| + !(o1 instanceof PigNullableWritable o2 instanceof PigNullableWritable) +){ + + NullableBytesWritable nbw1 = (NullableBytesWritable)o1; + NullableBytesWritable nbw2 = (NullableBytesWritable)o2; + + // If either are null, handle differently. + if (!nbw1.isNull() !nbw2.isNull()) { + rc = ((DataByteArray)nbw1.getValueAsPigType()).compareTo((DataByteArray)nbw2.getValueAsPigType()); + } else { + // For sorting purposes two nulls are equal. + if (nbw1.isNull() nbw2.isNull()) rc = 0; + else if (nbw1.isNull()) rc = -1; + else rc = 1; + } +} {code} The if statement takes us outside of original comparison code (enclosed in outer if above) ONLY if both compratee are PigNullableWritable that are not NullableBytesWritable. This may seem confusing at first glance, but what it does is do the identical thing as before the patch except for the new case that I introduced by allowing other types. The code is awkward, as Santhosh noted. But I am not too sure I understand the original implementation. But certainly, this way, we preserve original behavior and for new cases that this patch introduces, they are handled in the remaining else: {code} else{ + // enter here only if both o1 and o2 are non-NullableByteWritable PigNullableWritable's + PigNullableWritable nbw1 = (PigNullableWritable)o1; + PigNullableWritable nbw2 = (PigNullableWritable)o2; + // If either are null, handle differently. + if (!nbw1.isNull() !nbw2.isNull()) { + rc = nbw1.compareTo(nbw2); + } else { + // For sorting purposes two nulls are equal. + if (nbw1.isNull() nbw2.isNull()) rc = 0; + else if (nbw1.isNull()) rc = -1; + else rc = 1; + } +} {code} This is the safest way I can think of writing this code, and I have been able to order by a value begotten out of a map. Also, join and then sort keyed on values of maps both works. I guess something that flows better might be the following: {code} if(o1!=null o2!=null){ if((o1 instanceof PigNullableWritable o2 instanceof PigNullableWritable ){ PigNullableWritable nbw1 = (PigNullableWritable)o1; PigNullableWritable nbw2 = (PigNullableWritable)o2; // If either are null, handle differently. if (!nbw1.isNull() !nbw2.isNull()) { rc = nbw1.compareTo(nbw2); } else { // For sorting purposes two nulls are equal. if (nbw1.isNull() nbw2.isNull()) rc = 0; else if (nbw1.isNull()) rc = -1; else rc = 1; } }else{ throw new Exception(bad compare); } }else{ if(o1==null o2==null){rc=0;} else if(o1==null) {rc=-1;} else{ rc=1; } {code} But I must admit that I don't know what the right thing to do is. I don't know the design well enough to know if throwing an exception is the appropriate thing? Or something else? And would the last code block perform the right comparison in place of the original function? lmk of your thoughts on improvements to the patch. Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.5.0
[jira] Commented: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771571#action_12771571 ] hc busy commented on PIG-1016: -- Thejas, great point! Run time detection of type does use more time at run time and require more discipline to use. But I'd like to point out that the original implementation seemed to have allowed for this in PigStorage. The change to reduce the types that can be stored in the value of a map seems to reduce functionality of Pig. I guess the one case where I want to use map is when I have a sparse tuple, that I don't want to type in a type for each of the many fields. Because if I went to that trouble, I'd just write java code, or use something where schema is statically defined and stored. say, for simple example, self join of one row {{\[data1#\[score#15l,unique_id#100\],data2#\[score#15,foreign#00100\]\]}} {code} B = join A by m#data1#unique_id, A by m#data2#foriegn C = Filter B by $0#score=$1#score {code} I'd think something like this should work without me typing in the entire type structure. Also, what happens when BinStorage returns a map with value that isn't a bytearray, does the comparison fail? Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.5.0 Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Fix Version/s: (was: 0.4.0) 0.5.0 Status: Patch Available (was: Open) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.5.0 Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771170#action_12771170 ] hc busy commented on PIG-1016: -- Okay, trying to get this into a release of pig... I noticed 0.4 came , but nothing has happened on this ticket. Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.5.0 Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: (was: PIG-1016.patch) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: PIG-1016.patch Same patch as before, but the hash seems different. maybe I submitted the wrong patch previously. d337d3264bf5e6e925515ceff90718e10 Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767387#action_12767387 ] hc busy commented on PIG-1016: -- %...@#$, had me sweating for a while..., as mentioned previously, this is functionality that I'd like to use... not just fun weekend project... hehe.. thnx. Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: (was: PIG-1016.patch) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: PIG-1016.patch Re-attaching patch. It seems my previous patch didn't pass _any_ unit tests. Ouch! Anyway, I ran a few unit tests, they still pass on my machine. I've been accused of having crap on my machine that make programs pass their unit tests Hopefully those accusations were false, and when the unit test passes on my machine, it passes on the build machines too. 4b425...904b2 Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766710#action_12766710 ] hc busy commented on PIG-1016: -- 'kay, since my last comment, I've verified that in trunk, the patch in this ticket did not introduce an error. the Skewed join (correct or not) is returning the same number of rows when data read in is from a nested data structure as data read in from a tuple. Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766202#action_12766202 ] hc busy commented on PIG-1016: -- I skimed PIG-880. Here is a simplified version of what I might need to do: bash% cat map.dat [a#2,b#'d',c#(1,2,3)] [a#1,b#'a',c#(1,2,3)] [a#3,b#'c',c#(1,2,3)] bash% PIG gruntA= load 'map.dat' as (data:map[]); gruntB= foreach A generate (int)(data#'a'), (chararray)(data#'b'),(tuple())(data#'c'); gruntC= order B by $0; gruntdump C; (1,'a',(1,2,3)) (2,'d',(1,2,3)) (3,'c',(1,2,3)) gruntD= order B by $1; gruntdump D; (1,'a',(1,2,3)) (3,'c',(1,2,3)) (2,'d',(1,2,3)) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: (was: PIG-1016.patch) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: PIG-1016.patch Submitting patch to work-around both PIG-880 and PIG-1016 Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: (was: PIG-1016.patch) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Status: Open (was: Patch Available) Didn't pass a few other affected unit tests Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Status: Patch Available (was: Open) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: PIG-1016.patch Sorry, first time contributor. This submit includes the fix and fixes several unit tests that failed Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1016) Reading in map data seems broken
Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Status: Patch Available (was: Open) % diff org/apache/pig/data/parser/TextDataParser.jjt org/apache/pig/data/parser/newTextDataParser.jjt 145c145 String value = null; --- Object value = null; 149c149 (key = StringDatum() # value = StringDatum()) --- (key = StringDatum() # value = Datum()) 151c151 keyValues.put(key, new DataByteArray(value.getBytes(UTF-8))); --- keyValues.put(key, value); Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: map_to_any_value.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: map_to_any_value.patch A patch for org/apache/pig/data/parser/TextDataParser.jjt Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: map_to_any_value.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: (was: map_to_any_value.patch) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: trunk_map_to_any_value.patch Including a patch via svn diff. Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: trunk_map_to_any_value.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: PIG-1016.patch rename Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: (was: trunk_map_to_any_value.patch) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Status: Open (was: Patch Available) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: (was: PIG-1016.patch) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: PIG-1016.patch This patch is generated with svndiff and has a unit test Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: (was: PIG-1016.patch) Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hc busy updated PIG-1016: - Attachment: PIG-1016.patch Unit test plus patch. This time unit test actually passes. Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.