it is just not supported, similar to the case I mentioned before. I would suggest you write your own UDF to do that.
Shawn On Wed, Sep 14, 2011 at 10:59 AM, Pierre-Luc Brunet <[email protected]> wrote: > Under pig 0.9.1-SNAPSHOT, I get: > > 2011-09-14 12:54:10,075 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1200: <line 7, column 12> Syntax error, unexpected symbol at or near 'y' > > Under pig 0.8.1-cdh3u1, I get: > > 2011-09-14 12:55:35,765 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Encountered " <IDENTIFIER> "x "" at line 4, > column 15. > > Any idea what's the problem? > > -- > Pierre > > On 2011-09-14, at 12:52 PM, Xiaomeng Wan wrote: > >> wrong button or what? not sure, anyway, try this: >> >> a = group records by id; >> b = foreach a { x = COUNT(records); y = order records by thevalue; z = >> limit y x*0.95; z1 = order records by thevalue desc; z2 = limit z1 >> x*0.9; generate group, z2; } >> >> never try this before, if no luck, you need to find other workaround. >> >> Shawn >> >> On Wed, Sep 14, 2011 at 10:26 AM, Xiaomeng Wan <[email protected]> wrote: >>> Pierre, >>> >>> Union is not allowed within foreach. Fortunately, you donot need it. I >>> just realize the code I give you doesnot generate what you want, >>> actually it generates the complement of what you want. Try something >>> like this: >>> >>> a = group records by id; >>> b = foreach a { >>> >>> >>> >>> >>> On Wed, Sep 14, 2011 at 10:09 AM, Pierre-Luc Brunet >>> <[email protected]> wrote: >>>> Shawn, >>>> >>>> This looks indeed pretty good except for one thing. >>>> >>>> I already do a GROUP on my table in order to group my records by "item". >>>> If I run your code, I end up filtering against the entire data set instead >>>> of filtering each group individually. I tried to put your code inside a >>>> foreach statement without much luck. >>>> >>>> Any idea? >>>> >>>> -- >>>> Pierre >>>> >>>> >>>> On 2011-09-13, at 5:49 PM, Xiaomeng Wan wrote: >>>> >>>>> try this: >>>>> >>>>> a = group allrecords all; >>>>> b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR >>>>> >>>>> c = order allrecords by thevalue; >>>>> d = limit c b.total*0.05; >>>>> >>>>> e = order allrecords by thevalue desc; >>>>> f = limit e b.total*0.05; >>>>> >>>>> g = union d, f; >>>>> >>>>> Shawn >>>>> >>>>> On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <[email protected]> >>>>> wrote: >>>>>> Question. >>>>>> >>>>>> What would be the best way in Pig to grab a set of data, find the record >>>>>> that matches the 5th percentile, find the record that matches the 95th >>>>>> percentile and throw away what's before and after that? >>>>>> >>>>>> Obviously, my math doesn't work for this. >>>>>> >>>>>> In hope that it helps clarifying what I'm trying to do, here's how I >>>>>> currently do it in Javascript: >>>>>> >>>>>> result.bid_array.sort(function(a,b) { return a - b; }); >>>>>> var bid_p5 = Math.round(5/100 * result.bid_array.length); >>>>>> var bid_p95 = Math.round(95/100 * result.bid_array.length); >>>>>> >>>>>> result.bid_array.splice(bid_p95, result.bid_array.length - ( >>>>>> result.bid_array.length - bid_p95)); >>>>>> result.bid_array.splice(0, bid_p5); >>>>>> >>>>>> -- >>>>>> Pierre-Luc Brunet >>>>>> ZeStuff - http://www.zestuff.com >>>>>> >>>>>> Phone: (877) 5ZESTUFF >>>>>> Mobile: (514) 600-0234 >>>>>> Email: [email protected] >>>>>> >>>>>> 9320 Saint-Laurent, #502 >>>>>> Montreal, QC, Canada, H2N 1N7 >>>>>> >>>>>> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote: >>>>>> >>>>>>> If you look at the data for #25 you posted below, you will find that >>>>>>> there >>>>>>> is no row such that the price is between 5 and 95%! >>>>>>> khadgar is such an extreme outlier, it moves the 5% line above everyone >>>>>>> else, and of course it itself sets the 100% line. >>>>>>> >>>>>>> D >>>>>>> >>>>>>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> That worked except that for some reason, there's a lot of data that is >>>>>>>> missing in the final output (compared to what it should return). >>>>>>>> >>>>>>>> For example, the file I load has these lines: >>>>>>>> >>>>>>>> 7 25 us darkspear a Redacted 4750 >>>>>>>> 5000 1 >>>>>>>> 8 25 us emerald-dream a Lornadoome 9500 >>>>>>>> 10000 1 >>>>>>>> 21 25 eu khadgar a Haiibanklol 769499 809999 >>>>>>>> 1 >>>>>>>> 7 25 us queldorei a Worfgt 27862 34827 >>>>>>>> 1 >>>>>>>> 3 25 us antonidas a Oldcrafter 19000 >>>>>>>> 20000 1 >>>>>>>> >>>>>>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now >>>>>>>> grouped on only one column), I don't have any records with 25 as the >>>>>>>> key. >>>>>>>> The first 5 rows in my tsv files are >>>>>>>> >>>>>>>> 35 3.19973415E7 >>>>>>>> 36 122914.0 >>>>>>>> 37 50000.0 >>>>>>>> 38 416099.9 >>>>>>>> 39 901333.8571428572 >>>>>>>> 43 191496.5 >>>>>>>> 44 236454.0 >>>>>>>> >>>>>>>> >>>>>>>> I really have no idea where the missing rows went :\ >>>>>>>> >>>>>>>> -- >>>>>>>> Pierre-Luc Brunet >>>>>>>> ZeStuff - http://www.zestuff.com >>>>>>>> >>>>>>>> Phone: (877) 5ZESTUFF >>>>>>>> Mobile: (514) 600-0234 >>>>>>>> Email: [email protected] >>>>>>>> >>>>>>>> 9320 Saint-Laurent, #502 >>>>>>>> Montreal, QC, Canada, H2N 1N7 >>>>>>>> >>>>>>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote: >>>>>>>> >>>>>>>>> you can change >>>>>>>>> >>>>>>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95 AS >>>>>>>>> p95; >>>>>>>>> to >>>>>>>>> GENERATE FLATTEN(group) as (item, region, realm, faction), >>>>>>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95; >>>>>>>>> >>>>>>>>> then regroup after the foreach block >>>>>>>>> >>>>>>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95); >>>>>>>>> p2a = group p2 by (item, region, realm, faction); >>>>>>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price; >>>>>>>>> >>>>>>>>> or write you own UDF to get the average within the foreach block. It >>>>>>>>> would be ideal if we can move p2 statement into the foreach block like >>>>>>>>> this: p2 = filter autionsPrice by price >= p5 and price <= p95, but i >>>>>>>>> donot think it is supported right now. >>>>>>>>> >>>>>>>>> Shawn >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <[email protected]> >>>>>>>> wrote: >>>>>>>>>> Heya! >>>>>>>>>> >>>>>>>>>> I've been trying to do something with Pig for about 4 days now and I >>>>>>>> have nothing but failure to show for it. I was wondering if anybody >>>>>>>> could >>>>>>>> look at my queries and slap some sense into me? I've uploaded the >>>>>>>> queries to >>>>>>>> pastebin: http://pastebin.com/kzMxYwrY >>>>>>>>>> >>>>>>>>>> In short, I want to take my data, group it by 4 fields, then for each >>>>>>>> group, I want to: >>>>>>>>>> - Find out the 5th and the 95th percentile for the 'price' >>>>>>>>>> - Filter each group to remove the records that are < 5th percentile >>>>>>>>>> and >>>>>>>>> 95 percentile. >>>>>>>>>> >>>>>>>>>> Then for each group, I want to grab the AVG() of what's left. >>>>>>>>>> >>>>>>>>>> I tried many variations of the same code and always ended up with >>>>>>>>>> either >>>>>>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more >>>>>>>> than >>>>>>>> one row in the output." >>>>>>>>>> >>>>>>>>>> Any help would be greatly appreciated. Thanks! :) >>>>>>>>>> -- >>>>>>>>>> Pierre-Luc Brunet >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> >>>> >>> > > >
