Thanks Dmitriy,

The second method took less than 26 secs. on my computer (~550.000 lines).
The first method is giving me the following error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 34, column 7> Invalid field projection. Projected field [num_reqs]
does not exist in schema:
group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.

when I try to set the by_hour (after having set the by_hour_client):

grunt> by_hour_client =
>>  foreach
>>    (group logs by (hour, client))
>>  generate
>>    flatten(group) as (hour, client),
>>    COUNT(logs) as num_reqs;
grunt> by_hour =
>>  foreach
>>    (group by_hour_client by hour)
>>  generate
>>    group as hour,
>>    COUNT(by_hour_client) as num_dist_clients,
>>    SUM(num_reqs) as total_requests;

If I understood correctly that's because the num_reqs is in the bag, as a
result of the
*    (group by_hour_client by hour)*
correct? So I changed the last line to
    *SUM(by_hour_client.num_reqs) as total_requests;*
and it worked (it took a little more than 29 seconds).

Thanks for your help,
David


On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]> wrote:

> by_hour_client =
>  foreach
>    (group logs by (hour, client) parallel $p)
>  generate
>    flatten(group) as (hour, client),
>    COUNT(logs) as num_reqs;
>
> by_hour =
>  foreach
>    (group by_hour_client by hour parallel $p2)
>  generate
>    group as hour,
>    COUNT(by_hour_client) as num_dist_clients,
>    SUM(num_reqs) as total_requests;
>
> You can also do this using a nested distinct, but depending on what your
> data looks like, it might be a bad idea, as it can put a lot of pressure on
> individual reducers that have to do the inner distinct in memory (although
> they do push part of this up to the mappers):
>
> by_hour =
>  foreach (group logs by hour) {
>   dist_clients = distinct logs.client;
>   generate
>    group as hour,
>    COUNT(dist_clients) as num_dist_clients,
>    COUNT(logs) as total_requests;
> }
>
> D
>
> On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected]
> >wrote:
>
> > I'm analyzing a daily apache log file. I'd like to get the number of
> > requests and of visits by hour.
> >
> > I managed to get the requests, but how do I get the visits?
> >
> > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> (line:chararray);
> > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> >  FLATTEN(
> >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
> > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> >  ) AS (
> >    client:   chararray,
> >    username: chararray,
> >    date: chararray,
> >    hour: chararray,
> >    minute: chararray,
> >    second: chararray,
> >    timeZone: chararray,
> >    request:  chararray,
> >    statusCode: int,
> >    bytesSent: chararray,
> >    referer:  chararray,
> >    userAgent: chararray,
> >    remoteUser: chararray,
> >    timeTaken: chararray
> > );
> > grunt> A = GROUP LOGS_BASE BY hour;
> > DESCRIBE A;
> > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > chararray,timeZone: chararray,request: chararray,statusCode:
> int,bytesSent:
> > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > chararray,timeTaken: chararray)}}
> > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > grunt> C = ORDER B BY hour; -- requests by hour
> >
> > How can I now get the distinct count of clients per hour?
> >
> > Thanks for your help!
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
> >
> >
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
>



-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner 
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Reply via email to