woops hit enter. just to see, how long does it take if you just store h?

2012/4/3 Jonathan Coveney <[email protected]>

> point 1: doing dump is dangerous, depending on how many rows you expect in
> the relation. you're going to serialize every row in the output to your
> console
> point 2: the issue is that you're doing a nested DISTINCT. This is done in
> memory, and for large data sets can be quite slow. The scalable solution is
> to do two groups.
>
>
> 2012/4/3 sonia gehlot <[email protected]>
>
>> Thanks Guys,
>>
>> This is pig script which I am running, Dataset is also small for the
>> filtered date, which is around 2 million rows but I am targeting to write
>> this script for larger scope. In here titles is array of JSON object but
>> stored as string datatype so I am using python udf to split it into
>> columns/rows.
>>
>> register 'dse_pig-1.0.0.jar' ;
>> register 'udfs.py' using jython as udfs;
>> SET default_parallel 100;
>> a = load 'default.merchimpressions_client_log' using DataovenLoader();
>> b = filter a by (dateint == 20120331);
>> d = foreach b generate
>>    dateint,
>>    account_id,
>>    device_type,
>>    esn,
>>    country,
>>    flatten(udfs.to_map(titles)) as titles_map;
>> e = foreach d generate
>>    dateint,
>>    account_id,
>>    device_type,
>>    esn,
>>    country,
>>    (chararray)titles_map#'location' as location,
>>    (long)titles_map#'time' as time,
>>    (int)titles_map#'count' as count,
>>    (int)titles_map#'video_id' as title_id,
>>    (int)titles_map#'track_id' as location_id,
>>    (chararray)titles_map#'request_id' as request_id,
>>    (int)titles_map#'row' as pres_row,
>>    (int)titles_map#'rank' as pres_rank;
>>  f = group e by (dateint,
>>         account_id,
>>         device_type,
>>         esn,
>>         country,
>>         title_id,
>>         location_id,
>>         request_id,
>>         pres_row,
>>         pres_rank);
>> g = foreach f generate group,
>>               COUNT(e) as row_cnt,
>>               MIN(e.time) as min_time,
>>               SUM(e.count) as imp_cnt;
>> h = foreach g generate group.dateint as dateint,
>>               group.account_id as account_id,
>>               group.country as country,
>>               group.device_type as device_type,
>>               group.esn as esn,
>>               group.title_id as title_id,
>>               group.location_id as location_id,
>>               group.request_id as request_id,
>>               group.pres_row as pres_row,
>>               group.pres_rank as pres_rank,
>>               row_cnt as row_cnt,
>>               min_time as min_time,
>>               imp_cnt as imp_cnt;
>> i = group h by dateint;
>> j = foreach i  {
>>            dist_titles = DISTINCT h.title_id;
>>            dist_acct = DISTINCT h.account_id;
>>            generate
>>            group as dateint,
>>                (int)COUNT(dist_titles) as dist_titles,
>>                (int)COUNT(dist_acct) as dist_acct,
>>                (int)SUM(h.imp_cnt) as imp_cnt;
>>        };
>> dump j;
>>
>> If I run this script for 1-2 account_id then its really fast, comes back
>> in
>> minutes but for 2 million rows it takes for ever. It kicks of 3 map reduce
>> job and only Map part of 1 MR job take 2.5 hours.
>>
>> Please let me know how I can improve performance.
>>
>> Thanks,
>> Sonia
>>
>> On Mon, Apr 2, 2012 at 5:51 PM, Jonathan Coveney <[email protected]>
>> wrote:
>>
>> > One of these things is not like the other :) Distincts are dangerous.
>> > Prashant is right, post the script, and we can help you dig in. sum,
>> count,
>> > max, should all be super fast and if they aren't it's because the
>> Algebraic
>> > nature isn't being kicked off.
>> >
>> > 2012/4/2 Prashant Kommireddi <[email protected]>
>> >
>> > > Can you please forward the script and Job Counters? Cluster size - #
>> of
>> > Map
>> > > Reduce slots would be good too.
>> > >
>> > > Thanks,
>> > > Prashant
>> > >
>> > > On Mon, Apr 2, 2012 at 5:27 PM, sonia gehlot <[email protected]>
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I have a really large data set of about 10 to 15 billion rows. I
>> wanted
>> > > to
>> > > > do some aggregates like sum, count distinct, max etc but this is
>> taking
>> > > > forever to run the script.
>> > > >
>> > > > What hints or properties should I set to improve performance.
>> > > >
>> > > > Please let me know.
>> > > >
>> > > > Thanks,
>> > > > Sonia
>> > > >
>> > >
>> >
>>
>
>

Reply via email to