I tried with another log file and that does not happen, so I suppose there's
some 'corrupted' line in the one I was testing.

On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <[email protected]>wrote:

> There's something strage in the results however:
> (00,129,30096)
> (01,91,16487)
> (02,57,11686)
> (03,41,6041)
> (04,30,4882)
> (05,33,4154)
> (06,65,8031)
> (07,66,12260)
> (08,95,17924)
> (09,131,21187)
> (10,162,26607)
> (11,155,28503)
> (12,146,27863)
> (13,152,29130)
> (14,159,32784)
> (15,150,28898)
> (16,143,28973)
> (17,169,29024)
> (18,199,26585)
> (19,182,28803)
> (20,224,32511)
> (21,232,38584)
> (22,225,39924)
> (23,191,33606)
> (,0,0)
>
>
> What is the last line:
>  (,0,0)
> the count is zero, it shouldn't really be there, correct?
>
> (Using pig 0.9.0)
>
>
> Thanks,
> David
>
> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote:
>
>> Right, that should read "by_hour_client.num_reqs".
>>
>> Don't trust relative measurements you get for small data on a single
>> computer in local mode. Things change when you start running on hundreds
>> of
>> gigs with real skew on a cluster.
>>
>> D
>>
>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected]
>> >wrote:
>>
>> > Thanks Dmitriy,
>> >
>> > The second method took less than 26 secs. on my computer (~550.000
>> lines).
>> > The first method is giving me the following error:
>> >
>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>> > <line 34, column 7> Invalid field projection. Projected field [num_reqs]
>> > does not exist in schema:
>> >
>> >
>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>> >
>> > when I try to set the by_hour (after having set the by_hour_client):
>> >
>> > grunt> by_hour_client =
>> > >>  foreach
>> > >>    (group logs by (hour, client))
>> > >>  generate
>> > >>    flatten(group) as (hour, client),
>> > >>    COUNT(logs) as num_reqs;
>> > grunt> by_hour =
>> > >>  foreach
>> > >>    (group by_hour_client by hour)
>> > >>  generate
>> > >>    group as hour,
>> > >>    COUNT(by_hour_client) as num_dist_clients,
>> > >>    SUM(num_reqs) as total_requests;
>> >
>> > If I understood correctly that's because the num_reqs is in the bag, as
>> a
>> > result of the
>> > *    (group by_hour_client by hour)*
>> > correct? So I changed the last line to
>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>> > and it worked (it took a little more than 29 seconds).
>> >
>> > Thanks for your help,
>> > David
>> >
>> >
>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]>
>> > wrote:
>> >
>> > > by_hour_client =
>> > >  foreach
>> > >    (group logs by (hour, client) parallel $p)
>> > >  generate
>> > >    flatten(group) as (hour, client),
>> > >    COUNT(logs) as num_reqs;
>> > >
>> > > by_hour =
>> > >  foreach
>> > >    (group by_hour_client by hour parallel $p2)
>> > >  generate
>> > >    group as hour,
>> > >    COUNT(by_hour_client) as num_dist_clients,
>> > >    SUM(num_reqs) as total_requests;
>> > >
>> > > You can also do this using a nested distinct, but depending on what
>> your
>> > > data looks like, it might be a bad idea, as it can put a lot of
>> pressure
>> > on
>> > > individual reducers that have to do the inner distinct in memory
>> > (although
>> > > they do push part of this up to the mappers):
>> > >
>> > > by_hour =
>> > >  foreach (group logs by hour) {
>> > >   dist_clients = distinct logs.client;
>> > >   generate
>> > >    group as hour,
>> > >    COUNT(dist_clients) as num_dist_clients,
>> > >    COUNT(logs) as total_requests;
>> > > }
>> > >
>> > > D
>> > >
>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected]
>> > > >wrote:
>> > >
>> > > > I'm analyzing a daily apache log file. I'd like to get the number of
>> > > > requests and of visits by hour.
>> > > >
>> > > > I managed to get the requests, but how do I get the visits?
>> > > >
>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>> > > (line:chararray);
>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>> > > >  FLATTEN(
>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>> > (\\+\\d{4})\\]
>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>> > > >  ) AS (
>> > > >    client:   chararray,
>> > > >    username: chararray,
>> > > >    date: chararray,
>> > > >    hour: chararray,
>> > > >    minute: chararray,
>> > > >    second: chararray,
>> > > >    timeZone: chararray,
>> > > >    request:  chararray,
>> > > >    statusCode: int,
>> > > >    bytesSent: chararray,
>> > > >    referer:  chararray,
>> > > >    userAgent: chararray,
>> > > >    remoteUser: chararray,
>> > > >    timeTaken: chararray
>> > > > );
>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>> > > > DESCRIBE A;
>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>> > > int,bytesSent:
>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>> > > > chararray,timeTaken: chararray)}}
>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>> > > >
>> > > > How can I now get the distinct count of clients per hour?
>> > > >
>> > > > Thanks for your help!
>> > > >
>> > > > --
>> > > > David Riccitelli
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > > InsideOut10 s.r.l.
>> > > > P.IVA: IT-11381771002
>> > > > Fax: +39 0110708239
>> > > > ---
>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > > > Twitter: ziodave
>> > > > ---
>> > > > Layar Partner Network<
>> > > >
>> > >
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > David Riccitelli
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > > InsideOut10 s.r.l.
>> > > > P.IVA: IT-11381771002
>> > > > Fax: +39 0110708239
>> > > > ---
>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > > > Twitter: ziodave
>> > > > ---
>> > > > Layar Partner Network<
>> > > >
>> > >
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > David Riccitelli
>> >
>> >
>> >
>> ********************************************************************************
>> > InsideOut10 s.r.l.
>> > P.IVA: IT-11381771002
>> > Fax: +39 0110708239
>> > ---
>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > Twitter: ziodave
>> > ---
>> > Layar Partner Network<
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > >
>> >
>> >
>> ********************************************************************************
>> >
>>
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner 
> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner 
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Reply via email to