Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can 
take long time. 
 
I once forgot to comment out some debug line in my udf. When run with 
production data, not only it's slow, it blew up the cluster - simply run out of 
log space :)

On Jun 17, 2011, at 5:06 PM, Jonathan Coveney <[email protected]> wrote:

> A couple of possibilities that I'm kicking around off the top of my head...
> 
> 1) Does your MR job also sort afterwards? That's going to kick off another
> MR job
> 2) Does your MR job compile all the results into one job?
> 
> My guess is the Order+Dump are making it take longer.
> 
> 2011/6/17 Sujee Maniyam <[email protected]>
> 
>> I have log files like this:
>>  #timestamp (ms),     server,    user,    action,    domain , x,    y ,
>> z
>>  1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
>>  1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
>>  1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
>> 
>> I have the following pig script to count the number of domains from logs. (
>> For example, we have seen facebook.com 10 times ..etc.)
>> 
>> Here is the pig script:
>> 
>> --------------------------------
>> records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
>> server:int, user:int, action_id:int, domain:chararray, price:int);
>> 
>> -- DUMP records;
>> grouped_by_domain = GROUP records BY domain;
>> -- DUMP grouped_by_domain;
>> -- DESCRIBE grouped_by_domain;
>> 
>> freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records)
>> as
>> mycount;
>> -- DESCRIBE freq;
>> -- DUMP freq;
>> 
>> sorted = ORDER freq BY mycount DESC;
>> DUMP sorted;
>> --------------------------------
>> 
>> This script takes a hour to run.   I also wrote a simple Java MR job to
>> count the domains, it takes about 15 mins.  So the pig script is taking 4x
>> longer to complete.
>> 
>> any suggestions on what I am doing wrong in pig?
>> 
>> thanks
>> Sujee
>> http://sujee.net
>> 

Reply via email to