Right, that should read "by_hour_client.num_reqs".

Don't trust relative measurements you get for small data on a single
computer in local mode. Things change when you start running on hundreds of
gigs with real skew on a cluster.

D

On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected]>wrote:

> Thanks Dmitriy,
>
> The second method took less than 26 secs. on my computer (~550.000 lines).
> The first method is giving me the following error:
>
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
> <line 34, column 7> Invalid field projection. Projected field [num_reqs]
> does not exist in schema:
>
> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>
> when I try to set the by_hour (after having set the by_hour_client):
>
> grunt> by_hour_client =
> >>  foreach
> >>    (group logs by (hour, client))
> >>  generate
> >>    flatten(group) as (hour, client),
> >>    COUNT(logs) as num_reqs;
> grunt> by_hour =
> >>  foreach
> >>    (group by_hour_client by hour)
> >>  generate
> >>    group as hour,
> >>    COUNT(by_hour_client) as num_dist_clients,
> >>    SUM(num_reqs) as total_requests;
>
> If I understood correctly that's because the num_reqs is in the bag, as a
> result of the
> *    (group by_hour_client by hour)*
> correct? So I changed the last line to
>    *SUM(by_hour_client.num_reqs) as total_requests;*
> and it worked (it took a little more than 29 seconds).
>
> Thanks for your help,
> David
>
>
> On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
>
> > by_hour_client =
> >  foreach
> >    (group logs by (hour, client) parallel $p)
> >  generate
> >    flatten(group) as (hour, client),
> >    COUNT(logs) as num_reqs;
> >
> > by_hour =
> >  foreach
> >    (group by_hour_client by hour parallel $p2)
> >  generate
> >    group as hour,
> >    COUNT(by_hour_client) as num_dist_clients,
> >    SUM(num_reqs) as total_requests;
> >
> > You can also do this using a nested distinct, but depending on what your
> > data looks like, it might be a bad idea, as it can put a lot of pressure
> on
> > individual reducers that have to do the inner distinct in memory
> (although
> > they do push part of this up to the mappers):
> >
> > by_hour =
> >  foreach (group logs by hour) {
> >   dist_clients = distinct logs.client;
> >   generate
> >    group as hour,
> >    COUNT(dist_clients) as num_dist_clients,
> >    COUNT(logs) as total_requests;
> > }
> >
> > D
> >
> > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected]
> > >wrote:
> >
> > > I'm analyzing a daily apache log file. I'd like to get the number of
> > > requests and of visits by hour.
> > >
> > > I managed to get the requests, but how do I get the visits?
> > >
> > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> > (line:chararray);
> > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> > >  FLATTEN(
> > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
> (\\+\\d{4})\\]
> > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> > >  ) AS (
> > >    client:   chararray,
> > >    username: chararray,
> > >    date: chararray,
> > >    hour: chararray,
> > >    minute: chararray,
> > >    second: chararray,
> > >    timeZone: chararray,
> > >    request:  chararray,
> > >    statusCode: int,
> > >    bytesSent: chararray,
> > >    referer:  chararray,
> > >    userAgent: chararray,
> > >    remoteUser: chararray,
> > >    timeTaken: chararray
> > > );
> > > grunt> A = GROUP LOGS_BASE BY hour;
> > > DESCRIBE A;
> > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > > chararray,timeZone: chararray,request: chararray,statusCode:
> > int,bytesSent:
> > > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > > chararray,timeTaken: chararray)}}
> > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > > grunt> C = ORDER B BY hour; -- requests by hour
> > >
> > > How can I now get the distinct count of clients per hour?
> > >
> > > Thanks for your help!
> > >
> > > --
> > > David Riccitelli
> > >
> > >
> > >
> >
> ********************************************************************************
> > > InsideOut10 s.r.l.
> > > P.IVA: IT-11381771002
> > > Fax: +39 0110708239
> > > ---
> > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > Twitter: ziodave
> > > ---
> > > Layar Partner Network<
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > >
> > >
> > >
> >
> ********************************************************************************
> > >
> > >
> > >
> > >
> > > --
> > > David Riccitelli
> > >
> > >
> > >
> >
> ********************************************************************************
> > > InsideOut10 s.r.l.
> > > P.IVA: IT-11381771002
> > > Fax: +39 0110708239
> > > ---
> > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > Twitter: ziodave
> > > ---
> > > Layar Partner Network<
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > >
> > >
> > >
> >
> ********************************************************************************
> > >
> >
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>

Reply via email to