Hi,

I'm trying to write a pig script to create a list of the top N ip entries per 
hour. 


Currently I have something like this:


PER_IP = GROUP CFP_LOGS_CLICKS_WITHOUT_0 BY (dayNumber, hourNumber, ip);
IP_COUNT =  FOREACH PER_IP {
                                        numEntries = 
COUNT(CFP_LOGS_CLICKS_WITHOUT_0.timestamp);
                                        GENERATE group.dayNumber, 
group.hourNumber,  group.ip, numEntries;
                                   };

IP_COUNT_GROUPED = GROUP IP_COUNT BY ($0, $1);
IP_COUNT_PER_HOUR = FOREACH IP_COUNT_GROUPED GENERATE group.dayNumber, 
group.hourNumber, MAX(IP_COUNT.$3), AVG(IP_COUNT.$3);

DUMP IP_COUNT_PER_HOUR;



which gives me the highest number of hits per hour from 1 ip and the average 
number of hits per ip. What I would like to get is:

- The first N entries with hight visit count, preferably with count AND value

I've been looking at LIMIT and ORDER BY but don't really get how to wire them 
in so they operator on the group instead of all the data.

Any help and pointers appreciated!


-P

Reply via email to