by_hour_client =
foreach
(group logs by (hour, client) parallel $p)
generate
flatten(group) as (hour, client),
COUNT(logs) as num_reqs;
by_hour =
foreach
(group by_hour_client by hour parallel $p2)
generate
group as hour,
COUNT(by_hour_client) as num_dist_clients,
SUM(num_reqs) as total_requests;
You can also do this using a nested distinct, but depending on what your
data looks like, it might be a bad idea, as it can put a lot of pressure on
individual reducers that have to do the inner distinct in memory (although
they do push part of this up to the mappers):
by_hour =
foreach (group logs by hour) {
dist_clients = distinct logs.client;
generate
group as hour,
COUNT(dist_clients) as num_dist_clients,
COUNT(logs) as total_requests;
}
D
On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected]>wrote:
> I'm analyzing a daily apache log file. I'd like to get the number of
> requests and of visits by hour.
>
> I managed to get the requests, but how do I get the visits?
>
> grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
> grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> FLATTEN(
> REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
> "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> ) AS (
> client: chararray,
> username: chararray,
> date: chararray,
> hour: chararray,
> minute: chararray,
> second: chararray,
> timeZone: chararray,
> request: chararray,
> statusCode: int,
> bytesSent: chararray,
> referer: chararray,
> userAgent: chararray,
> remoteUser: chararray,
> timeTaken: chararray
> );
> grunt> A = GROUP LOGS_BASE BY hour;
> DESCRIBE A;
> A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> chararray,date: chararray,hour: chararray,minute: chararray,second:
> chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
> chararray,referer: chararray,userAgent: chararray,remoteUser:
> chararray,timeTaken: chararray)}}
> grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> grunt> C = ORDER B BY hour; -- requests by hour
>
> How can I now get the distinct count of clients per hour?
>
> Thanks for your help!
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>