Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can take long time. I once forgot to comment out some debug line in my udf. When run with production data, not only it's slow, it blew up the cluster - simply run out of log space :)
On Jun 17, 2011, at 5:06 PM, Jonathan Coveney <[email protected]> wrote: > A couple of possibilities that I'm kicking around off the top of my head... > > 1) Does your MR job also sort afterwards? That's going to kick off another > MR job > 2) Does your MR job compile all the results into one job? > > My guess is the Order+Dump are making it take longer. > > 2011/6/17 Sujee Maniyam <[email protected]> > >> I have log files like this: >> #timestamp (ms), server, user, action, domain , x, y , >> z >> 1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar >> 1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar >> 1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar >> >> I have the following pig script to count the number of domains from logs. ( >> For example, we have seen facebook.com 10 times ..etc.) >> >> Here is the pig script: >> >> -------------------------------- >> records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long, >> server:int, user:int, action_id:int, domain:chararray, price:int); >> >> -- DUMP records; >> grouped_by_domain = GROUP records BY domain; >> -- DUMP grouped_by_domain; >> -- DESCRIBE grouped_by_domain; >> >> freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records) >> as >> mycount; >> -- DESCRIBE freq; >> -- DUMP freq; >> >> sorted = ORDER freq BY mycount DESC; >> DUMP sorted; >> -------------------------------- >> >> This script takes a hour to run. I also wrote a simple Java MR job to >> count the domains, it takes about 15 mins. So the pig script is taking 4x >> longer to complete. >> >> any suggestions on what I am doing wrong in pig? >> >> thanks >> Sujee >> http://sujee.net >>
