Thanks Guys,
This is pig script which I am running, Dataset is also small for the
filtered date, which is around 2 million rows but I am targeting to write
this script for larger scope. In here titles is array of JSON object but
stored as string datatype so I am using python udf to split it into
columns/rows.
register 'dse_pig-1.0.0.jar' ;
register 'udfs.py' using jython as udfs;
SET default_parallel 100;
a = load 'default.merchimpressions_client_log' using DataovenLoader();
b = filter a by (dateint == 20120331);
d = foreach b generate
dateint,
account_id,
device_type,
esn,
country,
flatten(udfs.to_map(titles)) as titles_map;
e = foreach d generate
dateint,
account_id,
device_type,
esn,
country,
(chararray)titles_map#'location' as location,
(long)titles_map#'time' as time,
(int)titles_map#'count' as count,
(int)titles_map#'video_id' as title_id,
(int)titles_map#'track_id' as location_id,
(chararray)titles_map#'request_id' as request_id,
(int)titles_map#'row' as pres_row,
(int)titles_map#'rank' as pres_rank;
f = group e by (dateint,
account_id,
device_type,
esn,
country,
title_id,
location_id,
request_id,
pres_row,
pres_rank);
g = foreach f generate group,
COUNT(e) as row_cnt,
MIN(e.time) as min_time,
SUM(e.count) as imp_cnt;
h = foreach g generate group.dateint as dateint,
group.account_id as account_id,
group.country as country,
group.device_type as device_type,
group.esn as esn,
group.title_id as title_id,
group.location_id as location_id,
group.request_id as request_id,
group.pres_row as pres_row,
group.pres_rank as pres_rank,
row_cnt as row_cnt,
min_time as min_time,
imp_cnt as imp_cnt;
i = group h by dateint;
j = foreach i {
dist_titles = DISTINCT h.title_id;
dist_acct = DISTINCT h.account_id;
generate
group as dateint,
(int)COUNT(dist_titles) as dist_titles,
(int)COUNT(dist_acct) as dist_acct,
(int)SUM(h.imp_cnt) as imp_cnt;
};
dump j;
If I run this script for 1-2 account_id then its really fast, comes back in
minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce
job and only Map part of 1 MR job take 2.5 hours.
Please let me know how I can improve performance.
Thanks,
Sonia
On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <[email protected]> wrote:
> One of these things is not like the other :) Distincts are dangerous.
> Prashant is right, post the script, and we can help you dig in. sum, count,
> max, should all be super fast and if they aren't it's because the Algebraic
> nature isn't being kicked off.
>
> 2012/4/2 Prashant Kommireddi <[email protected]>
>
> > Can you please forward the script and Job Counters? Cluster size - # of
> Map
> > Reduce slots would be good too.
> >
> > Thanks,
> > Prashant
> >
> > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I have a really large data set of about 10 to 15 billion rows. I wanted
> > to
> > > do some aggregates like sum, count distinct, max etc but this is taking
> > > forever to run the script.
> > >
> > > What hints or properties should I set to improve performance.
> > >
> > > Please let me know.
> > >
> > > Thanks,
> > > Sonia
> > >
> >
>