Actually what I was looking for isn't for distributed quantiles. I was
looking for the share top x% do have. E.g. in my example it could be that
the top 10% of the users do have 50% of the total money.

So it looks like I'll need to come up with a UDF which delivers this.

Cheers,
-Marco
On 19 Mar 2013 00:23, "Mike Sukmanowsky" <[email protected]> wrote:

> Distributed quantiles aren't an easy problem to solve (as you can see from
> LinkedIn's source) but perhaps in time they'll be brought into core
> functions.  It wasn't until 0.11.0 that date/time functions were brought
> into built-in.  Had to use a combination of Piggybank and custom UDFs.
>
>
> On Mon, Mar 18, 2013 at 5:13 PM, Marco Cadetg <[email protected]> wrote:
>
> > Thanks a lot Mike. This seems to be what I'm looking for ;)
> >
> > I'm a bit disappointed that what I wanted to achieve isn't possible
> without
> > using any UDF.
> >
> > Cheers,
> > -Marco
> >
> >
> > On Mon, Mar 18, 2013 at 9:40 PM, Mike Sukmanowsky <[email protected]>
> > wrote:
> >
> > > You should check out the quantile libraries in LinkedIn's DataFu UDFs:
> > > https://github.com/linkedin/datafu specifically
> > >
> > >
> >
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/Quantile.javafor
> > > relatively small inputs, and
> > >
> > >
> >
> https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/stats/StreamingQuantile.javafor
> > > larger inputs.
> > >
> > > You can use this to receive the top x% for any given field and then you
> > can
> > > use that within a filter
> > >
> > >
> > > On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg <[email protected]>
> wrote:
> > >
> > > > Hi there,
> > > >
> > > > I would like to do something very similar to a nested foreach with
> > using
> > > > order by and then limit. But I would like to limit on a relation to
> the
> > > > total number of records.
> > > >
> > > > users = load 'users' as (userid:chararray, money:long,
> > region:chararray);
> > > > grouped_region = group users by region;
> > > > top_10_percent = foreach grouped_region {
> > > >             sorted = order users by money desc;
> > > >             top    = limit sorted $UKNOWN_HOWTO_SET; -- e.g. for the
> > top
> > > > 10% it would be total users/10 in that region.
> > > >             generate group, flatten(top);
> > > > };
> > > >
> > > > Thanks a lot for any help on this.
> > > >
> > > > Cheers,
> > > > -Marco
> > > >
> > >
> > >
> > >
> > > --
> > > Mike Sukmanowsky
> > >
> > > Product Lead, http://parse.ly
> > > 989 Avenue of the Americas, 3rd Floor
> > > New York, NY  10018
> > > p: +1 (416) 953-4248
> > > e: [email protected]
> > >
> >
>
>
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: [email protected]
>

Reply via email to