Actually I don't expect  lots of rows, there should be only one row in the
output. I will try with groups rather than distinct.

On Tue, Apr 3, 2012 at 12:09 PM, Jonathan Coveney <[email protected]>wrote:

> woops hit enter. just to see, how long does it take if you just store h?
>
> 2012/4/3 Jonathan Coveney <[email protected]>
>
> > point 1: doing dump is dangerous, depending on how many rows you expect
> in
> > the relation. you're going to serialize every row in the output to your
> > console
> > point 2: the issue is that you're doing a nested DISTINCT. This is done
> in
> > memory, and for large data sets can be quite slow. The scalable solution
> is
> > to do two groups.
> >
> >
> > 2012/4/3 sonia gehlot <[email protected]>
> >
> >> Thanks Guys,
> >>
> >> This is pig script which I am running, Dataset is also small for the
> >> filtered date, which is around 2 million rows but I am targeting to
> write
> >> this script for larger scope. In here titles is array of JSON object but
> >> stored as string datatype so I am using python udf to split it into
> >> columns/rows.
> >>
> >> register 'dse_pig-1.0.0.jar' ;
> >> register 'udfs.py' using jython as udfs;
> >> SET default_parallel 100;
> >> a = load 'default.merchimpressions_client_log' using DataovenLoader();
> >> b = filter a by (dateint == 20120331);
> >> d = foreach b generate
> >>    dateint,
> >>    account_id,
> >>    device_type,
> >>    esn,
> >>    country,
> >>    flatten(udfs.to_map(titles)) as titles_map;
> >> e = foreach d generate
> >>    dateint,
> >>    account_id,
> >>    device_type,
> >>    esn,
> >>    country,
> >>    (chararray)titles_map#'location' as location,
> >>    (long)titles_map#'time' as time,
> >>    (int)titles_map#'count' as count,
> >>    (int)titles_map#'video_id' as title_id,
> >>    (int)titles_map#'track_id' as location_id,
> >>    (chararray)titles_map#'request_id' as request_id,
> >>    (int)titles_map#'row' as pres_row,
> >>    (int)titles_map#'rank' as pres_rank;
> >>  f = group e by (dateint,
> >>         account_id,
> >>         device_type,
> >>         esn,
> >>         country,
> >>         title_id,
> >>         location_id,
> >>         request_id,
> >>         pres_row,
> >>         pres_rank);
> >> g = foreach f generate group,
> >>               COUNT(e) as row_cnt,
> >>               MIN(e.time) as min_time,
> >>               SUM(e.count) as imp_cnt;
> >> h = foreach g generate group.dateint as dateint,
> >>               group.account_id as account_id,
> >>               group.country as country,
> >>               group.device_type as device_type,
> >>               group.esn as esn,
> >>               group.title_id as title_id,
> >>               group.location_id as location_id,
> >>               group.request_id as request_id,
> >>               group.pres_row as pres_row,
> >>               group.pres_rank as pres_rank,
> >>               row_cnt as row_cnt,
> >>               min_time as min_time,
> >>               imp_cnt as imp_cnt;
> >> i = group h by dateint;
> >> j = foreach i  {
> >>            dist_titles = DISTINCT h.title_id;
> >>            dist_acct = DISTINCT h.account_id;
> >>            generate
> >>            group as dateint,
> >>                (int)COUNT(dist_titles) as dist_titles,
> >>                (int)COUNT(dist_acct) as dist_acct,
> >>                (int)SUM(h.imp_cnt) as imp_cnt;
> >>        };
> >> dump j;
> >>
> >> If I run this script for 1-2 account_id then its really fast, comes back
> >> in
> >> minutes but for 2 million rows it takes for ever. It kicks of 3 map
> reduce
> >> job and only Map part of 1 MR job take 2.5 hours.
> >>
> >> Please let me know how I can improve performance.
> >>
> >> Thanks,
> >> Sonia
> >>
> >> On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <[email protected]>
> >> wrote:
> >>
> >> > One of these things is not like the other :) Distincts are dangerous.
> >> > Prashant is right, post the script, and we can help you dig in. sum,
> >> count,
> >> > max, should all be super fast and if they aren't it's because the
> >> Algebraic
> >> > nature isn't being kicked off.
> >> >
> >> > 2012/4/2 Prashant Kommireddi <[email protected]>
> >> >
> >> > > Can you please forward the script and Job Counters? Cluster size - #
> >> of
> >> > Map
> >> > > Reduce slots would be good too.
> >> > >
> >> > > Thanks,
> >> > > Prashant
> >> > >
> >> > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <
> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > I have a really large data set of about 10 to 15 billion rows. I
> >> wanted
> >> > > to
> >> > > > do some aggregates like sum, count distinct, max etc but this is
> >> taking
> >> > > > forever to run the script.
> >> > > >
> >> > > > What hints or properties should I set to improve performance.
> >> > > >
> >> > > > Please let me know.
> >> > > >
> >> > > > Thanks,
> >> > > > Sonia
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Reply via email to