woops hit enter. just to see, how long does it take if you just store h? 2012/4/3 Jonathan Coveney <[email protected]>
> point 1: doing dump is dangerous, depending on how many rows you expect in > the relation. you're going to serialize every row in the output to your > console > point 2: the issue is that you're doing a nested DISTINCT. This is done in > memory, and for large data sets can be quite slow. The scalable solution is > to do two groups. > > > 2012/4/3 sonia gehlot <[email protected]> > >> Thanks Guys, >> >> This is pig script which I am running, Dataset is also small for the >> filtered date, which is around 2 million rows but I am targeting to write >> this script for larger scope. In here titles is array of JSON object but >> stored as string datatype so I am using python udf to split it into >> columns/rows. >> >> register 'dse_pig-1.0.0.jar' ; >> register 'udfs.py' using jython as udfs; >> SET default_parallel 100; >> a = load 'default.merchimpressions_client_log' using DataovenLoader(); >> b = filter a by (dateint == 20120331); >> d = foreach b generate >> dateint, >> account_id, >> device_type, >> esn, >> country, >> flatten(udfs.to_map(titles)) as titles_map; >> e = foreach d generate >> dateint, >> account_id, >> device_type, >> esn, >> country, >> (chararray)titles_map#'location' as location, >> (long)titles_map#'time' as time, >> (int)titles_map#'count' as count, >> (int)titles_map#'video_id' as title_id, >> (int)titles_map#'track_id' as location_id, >> (chararray)titles_map#'request_id' as request_id, >> (int)titles_map#'row' as pres_row, >> (int)titles_map#'rank' as pres_rank; >> f = group e by (dateint, >> account_id, >> device_type, >> esn, >> country, >> title_id, >> location_id, >> request_id, >> pres_row, >> pres_rank); >> g = foreach f generate group, >> COUNT(e) as row_cnt, >> MIN(e.time) as min_time, >> SUM(e.count) as imp_cnt; >> h = foreach g generate group.dateint as dateint, >> group.account_id as account_id, >> group.country as country, >> group.device_type as device_type, >> group.esn as esn, >> group.title_id as title_id, >> group.location_id as location_id, >> group.request_id as request_id, >> group.pres_row as pres_row, >> group.pres_rank as pres_rank, >> row_cnt as row_cnt, >> min_time as min_time, >> imp_cnt as imp_cnt; >> i = group h by dateint; >> j = foreach i { >> dist_titles = DISTINCT h.title_id; >> dist_acct = DISTINCT h.account_id; >> generate >> group as dateint, >> (int)COUNT(dist_titles) as dist_titles, >> (int)COUNT(dist_acct) as dist_acct, >> (int)SUM(h.imp_cnt) as imp_cnt; >> }; >> dump j; >> >> If I run this script for 1-2 account_id then its really fast, comes back >> in >> minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce >> job and only Map part of 1 MR job take 2.5 hours. >> >> Please let me know how I can improve performance. >> >> Thanks, >> Sonia >> >> On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <[email protected]> >> wrote: >> >> > One of these things is not like the other :) Distincts are dangerous. >> > Prashant is right, post the script, and we can help you dig in. sum, >> count, >> > max, should all be super fast and if they aren't it's because the >> Algebraic >> > nature isn't being kicked off. >> > >> > 2012/4/2 Prashant Kommireddi <[email protected]> >> > >> > > Can you please forward the script and Job Counters? Cluster size - # >> of >> > Map >> > > Reduce slots would be good too. >> > > >> > > Thanks, >> > > Prashant >> > > >> > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <[email protected]> >> > > wrote: >> > > >> > > > Hi, >> > > > >> > > > I have a really large data set of about 10 to 15 billion rows. I >> wanted >> > > to >> > > > do some aggregates like sum, count distinct, max etc but this is >> taking >> > > > forever to run the script. >> > > > >> > > > What hints or properties should I set to improve performance. >> > > > >> > > > Please let me know. >> > > > >> > > > Thanks, >> > > > Sonia >> > > > >> > > >> > >> > >
