Thanks Dmitriy,
The second method took less than 26 secs. on my computer (~550.000 lines).
The first method is giving me the following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 34, column 7> Invalid field projection. Projected field [num_reqs]
does not exist in schema:
group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
when I try to set the by_hour (after having set the by_hour_client):
grunt> by_hour_client =
>> foreach
>> (group logs by (hour, client))
>> generate
>> flatten(group) as (hour, client),
>> COUNT(logs) as num_reqs;
grunt> by_hour =
>> foreach
>> (group by_hour_client by hour)
>> generate
>> group as hour,
>> COUNT(by_hour_client) as num_dist_clients,
>> SUM(num_reqs) as total_requests;
If I understood correctly that's because the num_reqs is in the bag, as a
result of the
* (group by_hour_client by hour)*
correct? So I changed the last line to
*SUM(by_hour_client.num_reqs) as total_requests;*
and it worked (it took a little more than 29 seconds).
Thanks for your help,
David
On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]> wrote:
> by_hour_client =
> foreach
> (group logs by (hour, client) parallel $p)
> generate
> flatten(group) as (hour, client),
> COUNT(logs) as num_reqs;
>
> by_hour =
> foreach
> (group by_hour_client by hour parallel $p2)
> generate
> group as hour,
> COUNT(by_hour_client) as num_dist_clients,
> SUM(num_reqs) as total_requests;
>
> You can also do this using a nested distinct, but depending on what your
> data looks like, it might be a bad idea, as it can put a lot of pressure on
> individual reducers that have to do the inner distinct in memory (although
> they do push part of this up to the mappers):
>
> by_hour =
> foreach (group logs by hour) {
> dist_clients = distinct logs.client;
> generate
> group as hour,
> COUNT(dist_clients) as num_dist_clients,
> COUNT(logs) as total_requests;
> }
>
> D
>
> On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected]
> >wrote:
>
> > I'm analyzing a daily apache log file. I'd like to get the number of
> > requests and of visits by hour.
> >
> > I managed to get the requests, but how do I get the visits?
> >
> > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> (line:chararray);
> > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> > FLATTEN(
> > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
> > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> > ) AS (
> > client: chararray,
> > username: chararray,
> > date: chararray,
> > hour: chararray,
> > minute: chararray,
> > second: chararray,
> > timeZone: chararray,
> > request: chararray,
> > statusCode: int,
> > bytesSent: chararray,
> > referer: chararray,
> > userAgent: chararray,
> > remoteUser: chararray,
> > timeTaken: chararray
> > );
> > grunt> A = GROUP LOGS_BASE BY hour;
> > DESCRIBE A;
> > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > chararray,timeZone: chararray,request: chararray,statusCode:
> int,bytesSent:
> > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > chararray,timeTaken: chararray)}}
> > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > grunt> C = ORDER B BY hour; -- requests by hour
> >
> > How can I now get the distinct count of clients per hour?
> >
> > Thanks for your help!
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
> >
> >
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************