Thanks Guys,

This is pig script which I am running, Dataset is also small for the
filtered date, which is around 2 million rows but I am targeting to write
this script for larger scope. In here titles is array of JSON object but
stored as string datatype so I am using python udf to split it into
columns/rows.

register 'dse_pig-1.0.0.jar' ;
register 'udfs.py' using jython as udfs;
SET default_parallel 100;
a = load 'default.merchimpressions_client_log' using DataovenLoader();
b = filter a by (dateint == 20120331);
d = foreach b generate
    dateint,
    account_id,
    device_type,
    esn,
    country,
    flatten(udfs.to_map(titles)) as titles_map;
e = foreach d generate
    dateint,
    account_id,
    device_type,
    esn,
    country,
    (chararray)titles_map#'location' as location,
    (long)titles_map#'time' as time,
    (int)titles_map#'count' as count,
    (int)titles_map#'video_id' as title_id,
    (int)titles_map#'track_id' as location_id,
    (chararray)titles_map#'request_id' as request_id,
    (int)titles_map#'row' as pres_row,
    (int)titles_map#'rank' as pres_rank;
 f = group e by (dateint,
         account_id,
         device_type,
         esn,
         country,
         title_id,
         location_id,
         request_id,
         pres_row,
         pres_rank);
g = foreach f generate group,
               COUNT(e) as row_cnt,
               MIN(e.time) as min_time,
               SUM(e.count) as imp_cnt;
h = foreach g generate group.dateint as dateint,
               group.account_id as account_id,
               group.country as country,
               group.device_type as device_type,
               group.esn as esn,
               group.title_id as title_id,
               group.location_id as location_id,
               group.request_id as request_id,
               group.pres_row as pres_row,
               group.pres_rank as pres_rank,
               row_cnt as row_cnt,
               min_time as min_time,
               imp_cnt as imp_cnt;
i = group h by dateint;
j = foreach i  {
            dist_titles = DISTINCT h.title_id;
            dist_acct = DISTINCT h.account_id;
            generate
            group as dateint,
                (int)COUNT(dist_titles) as dist_titles,
                (int)COUNT(dist_acct) as dist_acct,
                (int)SUM(h.imp_cnt) as imp_cnt;
        };
dump j;

If I run this script for 1-2 account_id then its really fast, comes back in
minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce
job and only Map part of 1 MR job take 2.5 hours.

Please let me know how I can improve performance.

Thanks,
Sonia

On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <[email protected]> wrote:

> One of these things is not like the other :) Distincts are dangerous.
> Prashant is right, post the script, and we can help you dig in. sum, count,
> max, should all be super fast and if they aren't it's because the Algebraic
> nature isn't being kicked off.
>
> 2012/4/2 Prashant Kommireddi <[email protected]>
>
> > Can you please forward the script and Job Counters? Cluster size - # of
> Map
> > Reduce slots would be good too.
> >
> > Thanks,
> > Prashant
> >
> > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I have a really large data set of about 10 to 15 billion rows. I wanted
> > to
> > > do some aggregates like sum, count distinct, max etc but this is taking
> > > forever to run the script.
> > >
> > > What hints or properties should I set to improve performance.
> > >
> > > Please let me know.
> > >
> > > Thanks,
> > > Sonia
> > >
> >
>

Reply via email to