point 1: doing dump is dangerous, depending on how many rows you expect in
the relation. you're going to serialize every row in the output to your
console
point 2: the issue is that you're doing a nested DISTINCT. This is done in
memory, and for large data sets can be quite slow. The scalable solution is
to do two groups.

2012/4/3 sonia gehlot <[email protected]>

> Thanks Guys,
>
> This is pig script which I am running, Dataset is also small for the
> filtered date, which is around 2 million rows but I am targeting to write
> this script for larger scope. In here titles is array of JSON object but
> stored as string datatype so I am using python udf to split it into
> columns/rows.
>
> register 'dse_pig-1.0.0.jar' ;
> register 'udfs.py' using jython as udfs;
> SET default_parallel 100;
> a = load 'default.merchimpressions_client_log' using DataovenLoader();
> b = filter a by (dateint == 20120331);
> d = foreach b generate
>    dateint,
>    account_id,
>    device_type,
>    esn,
>    country,
>    flatten(udfs.to_map(titles)) as titles_map;
> e = foreach d generate
>    dateint,
>    account_id,
>    device_type,
>    esn,
>    country,
>    (chararray)titles_map#'location' as location,
>    (long)titles_map#'time' as time,
>    (int)titles_map#'count' as count,
>    (int)titles_map#'video_id' as title_id,
>    (int)titles_map#'track_id' as location_id,
>    (chararray)titles_map#'request_id' as request_id,
>    (int)titles_map#'row' as pres_row,
>    (int)titles_map#'rank' as pres_rank;
>  f = group e by (dateint,
>         account_id,
>         device_type,
>         esn,
>         country,
>         title_id,
>         location_id,
>         request_id,
>         pres_row,
>         pres_rank);
> g = foreach f generate group,
>               COUNT(e) as row_cnt,
>               MIN(e.time) as min_time,
>               SUM(e.count) as imp_cnt;
> h = foreach g generate group.dateint as dateint,
>               group.account_id as account_id,
>               group.country as country,
>               group.device_type as device_type,
>               group.esn as esn,
>               group.title_id as title_id,
>               group.location_id as location_id,
>               group.request_id as request_id,
>               group.pres_row as pres_row,
>               group.pres_rank as pres_rank,
>               row_cnt as row_cnt,
>               min_time as min_time,
>               imp_cnt as imp_cnt;
> i = group h by dateint;
> j = foreach i  {
>            dist_titles = DISTINCT h.title_id;
>            dist_acct = DISTINCT h.account_id;
>            generate
>            group as dateint,
>                (int)COUNT(dist_titles) as dist_titles,
>                (int)COUNT(dist_acct) as dist_acct,
>                (int)SUM(h.imp_cnt) as imp_cnt;
>        };
> dump j;
>
> If I run this script for 1-2 account_id then its really fast, comes back in
> minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce
> job and only Map part of 1 MR job take 2.5 hours.
>
> Please let me know how I can improve performance.
>
> Thanks,
> Sonia
>
> On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <[email protected]>
> wrote:
>
> > One of these things is not like the other :) Distincts are dangerous.
> > Prashant is right, post the script, and we can help you dig in. sum,
> count,
> > max, should all be super fast and if they aren't it's because the
> Algebraic
> > nature isn't being kicked off.
> >
> > 2012/4/2 Prashant Kommireddi <[email protected]>
> >
> > > Can you please forward the script and Job Counters? Cluster size - # of
> > Map
> > > Reduce slots would be good too.
> > >
> > > Thanks,
> > > Prashant
> > >
> > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <[email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a really large data set of about 10 to 15 billion rows. I
> wanted
> > > to
> > > > do some aggregates like sum, count distinct, max etc but this is
> taking
> > > > forever to run the script.
> > > >
> > > > What hints or properties should I set to improve performance.
> > > >
> > > > Please let me know.
> > > >
> > > > Thanks,
> > > > Sonia
> > > >
> > >
> >
>

Reply via email to