Actually I don't expect lots of rows, there should be only one row in the output. I will try with groups rather than distinct.
On Tue, Apr 3, 2012 at 12:09 PM, Jonathan Coveney <[email protected]>wrote: > woops hit enter. just to see, how long does it take if you just store h? > > 2012/4/3 Jonathan Coveney <[email protected]> > > > point 1: doing dump is dangerous, depending on how many rows you expect > in > > the relation. you're going to serialize every row in the output to your > > console > > point 2: the issue is that you're doing a nested DISTINCT. This is done > in > > memory, and for large data sets can be quite slow. The scalable solution > is > > to do two groups. > > > > > > 2012/4/3 sonia gehlot <[email protected]> > > > >> Thanks Guys, > >> > >> This is pig script which I am running, Dataset is also small for the > >> filtered date, which is around 2 million rows but I am targeting to > write > >> this script for larger scope. In here titles is array of JSON object but > >> stored as string datatype so I am using python udf to split it into > >> columns/rows. > >> > >> register 'dse_pig-1.0.0.jar' ; > >> register 'udfs.py' using jython as udfs; > >> SET default_parallel 100; > >> a = load 'default.merchimpressions_client_log' using DataovenLoader(); > >> b = filter a by (dateint == 20120331); > >> d = foreach b generate > >> dateint, > >> account_id, > >> device_type, > >> esn, > >> country, > >> flatten(udfs.to_map(titles)) as titles_map; > >> e = foreach d generate > >> dateint, > >> account_id, > >> device_type, > >> esn, > >> country, > >> (chararray)titles_map#'location' as location, > >> (long)titles_map#'time' as time, > >> (int)titles_map#'count' as count, > >> (int)titles_map#'video_id' as title_id, > >> (int)titles_map#'track_id' as location_id, > >> (chararray)titles_map#'request_id' as request_id, > >> (int)titles_map#'row' as pres_row, > >> (int)titles_map#'rank' as pres_rank; > >> f = group e by (dateint, > >> account_id, > >> device_type, > >> esn, > >> country, > >> title_id, > >> location_id, > >> request_id, > >> pres_row, > >> pres_rank); > >> g = foreach f generate group, > >> COUNT(e) as row_cnt, > >> MIN(e.time) as min_time, > >> SUM(e.count) as imp_cnt; > >> h = foreach g generate group.dateint as dateint, > >> group.account_id as account_id, > >> group.country as country, > >> group.device_type as device_type, > >> group.esn as esn, > >> group.title_id as title_id, > >> group.location_id as location_id, > >> group.request_id as request_id, > >> group.pres_row as pres_row, > >> group.pres_rank as pres_rank, > >> row_cnt as row_cnt, > >> min_time as min_time, > >> imp_cnt as imp_cnt; > >> i = group h by dateint; > >> j = foreach i { > >> dist_titles = DISTINCT h.title_id; > >> dist_acct = DISTINCT h.account_id; > >> generate > >> group as dateint, > >> (int)COUNT(dist_titles) as dist_titles, > >> (int)COUNT(dist_acct) as dist_acct, > >> (int)SUM(h.imp_cnt) as imp_cnt; > >> }; > >> dump j; > >> > >> If I run this script for 1-2 account_id then its really fast, comes back > >> in > >> minutes but for 2 million rows it takes for ever. It kicks of 3 map > reduce > >> job and only Map part of 1 MR job take 2.5 hours. > >> > >> Please let me know how I can improve performance. > >> > >> Thanks, > >> Sonia > >> > >> On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <[email protected]> > >> wrote: > >> > >> > One of these things is not like the other :) Distincts are dangerous. > >> > Prashant is right, post the script, and we can help you dig in. sum, > >> count, > >> > max, should all be super fast and if they aren't it's because the > >> Algebraic > >> > nature isn't being kicked off. > >> > > >> > 2012/4/2 Prashant Kommireddi <[email protected]> > >> > > >> > > Can you please forward the script and Job Counters? Cluster size - # > >> of > >> > Map > >> > > Reduce slots would be good too. > >> > > > >> > > Thanks, > >> > > Prashant > >> > > > >> > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot < > [email protected]> > >> > > wrote: > >> > > > >> > > > Hi, > >> > > > > >> > > > I have a really large data set of about 10 to 15 billion rows. I > >> wanted > >> > > to > >> > > > do some aggregates like sum, count distinct, max etc but this is > >> taking > >> > > > forever to run the script. > >> > > > > >> > > > What hints or properties should I set to improve performance. > >> > > > > >> > > > Please let me know. > >> > > > > >> > > > Thanks, > >> > > > Sonia > >> > > > > >> > > > >> > > >> > > > > >
