Actually what I was looking for isn't for distributed quantiles. I was looking for the share top x% do have. E.g. in my example it could be that the top 10% of the users do have 50% of the total money.
So it looks like I'll need to come up with a UDF which delivers this. Cheers, -Marco On 19 Mar 2013 00:23, "Mike Sukmanowsky" <[email protected]> wrote: > Distributed quantiles aren't an easy problem to solve (as you can see from > LinkedIn's source) but perhaps in time they'll be brought into core > functions. It wasn't until 0.11.0 that date/time functions were brought > into built-in. Had to use a combination of Piggybank and custom UDFs. > > > On Mon, Mar 18, 2013 at 5:13 PM, Marco Cadetg <[email protected]> wrote: > > > Thanks a lot Mike. This seems to be what I'm looking for ;) > > > > I'm a bit disappointed that what I wanted to achieve isn't possible > without > > using any UDF. > > > > Cheers, > > -Marco > > > > > > On Mon, Mar 18, 2013 at 9:40 PM, Mike Sukmanowsky <[email protected]> > > wrote: > > > > > You should check out the quantile libraries in LinkedIn's DataFu UDFs: > > > https://github.com/linkedin/datafu specifically > > > > > > > > > https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/Quantile.javafor > > > relatively small inputs, and > > > > > > > > > https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/StreamingQuantile.javafor > > > larger inputs. > > > > > > You can use this to receive the top x% for any given field and then you > > can > > > use that within a filter > > > > > > > > > On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg <[email protected]> > wrote: > > > > > > > Hi there, > > > > > > > > I would like to do something very similar to a nested foreach with > > using > > > > order by and then limit. But I would like to limit on a relation to > the > > > > total number of records. > > > > > > > > users = load 'users' as (userid:chararray, money:long, > > region:chararray); > > > > grouped_region = group users by region; > > > > top_10_percent = foreach grouped_region { > > > > sorted = order users by money desc; > > > > top = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the > > top > > > > 10% it would be total users/10 in that region. > > > > generate group, flatten(top); > > > > }; > > > > > > > > Thanks a lot for any help on this. > > > > > > > > Cheers, > > > > -Marco > > > > > > > > > > > > > > > > -- > > > Mike Sukmanowsky > > > > > > Product Lead, http://parse.ly > > > 989 Avenue of the Americas, 3rd Floor > > > New York, NY 10018 > > > p: +1 (416) 953-4248 > > > e: [email protected] > > > > > > > > > -- > Mike Sukmanowsky > > Product Lead, http://parse.ly > 989 Avenue of the Americas, 3rd Floor > New York, NY 10018 > p: +1 (416) 953-4248 > e: [email protected] >
