point 1: doing dump is dangerous, depending on how many rows you expect in the relation. you're going to serialize every row in the output to your console point 2: the issue is that you're doing a nested DISTINCT. This is done in memory, and for large data sets can be quite slow. The scalable solution is to do two groups.
2012/4/3 sonia gehlot <[email protected]> > Thanks Guys, > > This is pig script which I am running, Dataset is also small for the > filtered date, which is around 2 million rows but I am targeting to write > this script for larger scope. In here titles is array of JSON object but > stored as string datatype so I am using python udf to split it into > columns/rows. > > register 'dse_pig-1.0.0.jar' ; > register 'udfs.py' using jython as udfs; > SET default_parallel 100; > a = load 'default.merchimpressions_client_log' using DataovenLoader(); > b = filter a by (dateint == 20120331); > d = foreach b generate > dateint, > account_id, > device_type, > esn, > country, > flatten(udfs.to_map(titles)) as titles_map; > e = foreach d generate > dateint, > account_id, > device_type, > esn, > country, > (chararray)titles_map#'location' as location, > (long)titles_map#'time' as time, > (int)titles_map#'count' as count, > (int)titles_map#'video_id' as title_id, > (int)titles_map#'track_id' as location_id, > (chararray)titles_map#'request_id' as request_id, > (int)titles_map#'row' as pres_row, > (int)titles_map#'rank' as pres_rank; > f = group e by (dateint, > account_id, > device_type, > esn, > country, > title_id, > location_id, > request_id, > pres_row, > pres_rank); > g = foreach f generate group, > COUNT(e) as row_cnt, > MIN(e.time) as min_time, > SUM(e.count) as imp_cnt; > h = foreach g generate group.dateint as dateint, > group.account_id as account_id, > group.country as country, > group.device_type as device_type, > group.esn as esn, > group.title_id as title_id, > group.location_id as location_id, > group.request_id as request_id, > group.pres_row as pres_row, > group.pres_rank as pres_rank, > row_cnt as row_cnt, > min_time as min_time, > imp_cnt as imp_cnt; > i = group h by dateint; > j = foreach i { > dist_titles = DISTINCT h.title_id; > dist_acct = DISTINCT h.account_id; > generate > group as dateint, > (int)COUNT(dist_titles) as dist_titles, > (int)COUNT(dist_acct) as dist_acct, > (int)SUM(h.imp_cnt) as imp_cnt; > }; > dump j; > > If I run this script for 1-2 account_id then its really fast, comes back in > minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce > job and only Map part of 1 MR job take 2.5 hours. > > Please let me know how I can improve performance. > > Thanks, > Sonia > > On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <[email protected]> > wrote: > > > One of these things is not like the other :) Distincts are dangerous. > > Prashant is right, post the script, and we can help you dig in. sum, > count, > > max, should all be super fast and if they aren't it's because the > Algebraic > > nature isn't being kicked off. > > > > 2012/4/2 Prashant Kommireddi <[email protected]> > > > > > Can you please forward the script and Job Counters? Cluster size - # of > > Map > > > Reduce slots would be good too. > > > > > > Thanks, > > > Prashant > > > > > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <[email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > I have a really large data set of about 10 to 15 billion rows. I > wanted > > > to > > > > do some aggregates like sum, count distinct, max etc but this is > taking > > > > forever to run the script. > > > > > > > > What hints or properties should I set to improve performance. > > > > > > > > Please let me know. > > > > > > > > Thanks, > > > > Sonia > > > > > > > > > >
