I have log files like this:
#timestamp (ms), server, user, action, domain , x, y ,
z
1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
I have the following pig script to count the number of domains from logs. (
For example, we have seen facebook.com 10 times ..etc.)
Here is the pig script:
--------------------------------
records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
server:int, user:int, action_id:int, domain:chararray, price:int);
-- DUMP records;
grouped_by_domain = GROUP records BY domain;
-- DUMP grouped_by_domain;
-- DESCRIBE grouped_by_domain;
freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records) as
mycount;
-- DESCRIBE freq;
-- DUMP freq;
sorted = ORDER freq BY mycount DESC;
DUMP sorted;
--------------------------------
This script takes a hour to run. I also wrote a simple Java MR job to
count the domains, it takes about 15 mins. So the pig script is taking 4x
longer to complete.
any suggestions on what I am doing wrong in pig?
thanks
Sujee
http://sujee.net