I still can't manage to accomplish my objectives. I'm trying to get now the
max time taken so, as a test, I do:
grunt> A = GROUP logs BY client;

then (timeTaken is long):
B = FOREACH A GENERATE group, MAX( logs.timeTaken );

when I dump it, I get the following error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
while computing max in Initial
(...)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Long
at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)

Initially I thought that I had some timeTaken not compatible with long data
type, but I checked and re-checked. I also get the timeTaken as \d+ regular
expression.

What am I doing wrong?

Thanks!
David

On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <[email protected]>wrote:

> I tried with another log file and that does not happen, so I suppose
> there's some 'corrupted' line in the one I was testing.
>
>
> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <[email protected]>wrote:
>
>> There's something strage in the results however:
>> (00,129,30096)
>> (01,91,16487)
>> (02,57,11686)
>> (03,41,6041)
>> (04,30,4882)
>> (05,33,4154)
>> (06,65,8031)
>> (07,66,12260)
>> (08,95,17924)
>> (09,131,21187)
>> (10,162,26607)
>> (11,155,28503)
>> (12,146,27863)
>> (13,152,29130)
>> (14,159,32784)
>> (15,150,28898)
>> (16,143,28973)
>> (17,169,29024)
>> (18,199,26585)
>> (19,182,28803)
>> (20,224,32511)
>> (21,232,38584)
>> (22,225,39924)
>> (23,191,33606)
>> (,0,0)
>>
>>
>> What is the last line:
>>  (,0,0)
>> the count is zero, it shouldn't really be there, correct?
>>
>> (Using pig 0.9.0)
>>
>>
>> Thanks,
>> David
>>
>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote:
>>
>>> Right, that should read "by_hour_client.num_reqs".
>>>
>>> Don't trust relative measurements you get for small data on a single
>>> computer in local mode. Things change when you start running on hundreds
>>> of
>>> gigs with real skew on a cluster.
>>>
>>> D
>>>
>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected]
>>> >wrote:
>>>
>>> > Thanks Dmitriy,
>>> >
>>> > The second method took less than 26 secs. on my computer (~550.000
>>> lines).
>>> > The first method is giving me the following error:
>>> >
>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>> > <line 34, column 7> Invalid field projection. Projected field
>>> [num_reqs]
>>> > does not exist in schema:
>>> >
>>> >
>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>> >
>>> > when I try to set the by_hour (after having set the by_hour_client):
>>> >
>>> > grunt> by_hour_client =
>>> > >>  foreach
>>> > >>    (group logs by (hour, client))
>>> > >>  generate
>>> > >>    flatten(group) as (hour, client),
>>> > >>    COUNT(logs) as num_reqs;
>>> > grunt> by_hour =
>>> > >>  foreach
>>> > >>    (group by_hour_client by hour)
>>> > >>  generate
>>> > >>    group as hour,
>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>> > >>    SUM(num_reqs) as total_requests;
>>> >
>>> > If I understood correctly that's because the num_reqs is in the bag, as
>>> a
>>> > result of the
>>> > *    (group by_hour_client by hour)*
>>> > correct? So I changed the last line to
>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>> > and it worked (it took a little more than 29 seconds).
>>> >
>>> > Thanks for your help,
>>> > David
>>> >
>>> >
>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]>
>>> > wrote:
>>> >
>>> > > by_hour_client =
>>> > >  foreach
>>> > >    (group logs by (hour, client) parallel $p)
>>> > >  generate
>>> > >    flatten(group) as (hour, client),
>>> > >    COUNT(logs) as num_reqs;
>>> > >
>>> > > by_hour =
>>> > >  foreach
>>> > >    (group by_hour_client by hour parallel $p2)
>>> > >  generate
>>> > >    group as hour,
>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>> > >    SUM(num_reqs) as total_requests;
>>> > >
>>> > > You can also do this using a nested distinct, but depending on what
>>> your
>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>> pressure
>>> > on
>>> > > individual reducers that have to do the inner distinct in memory
>>> > (although
>>> > > they do push part of this up to the mappers):
>>> > >
>>> > > by_hour =
>>> > >  foreach (group logs by hour) {
>>> > >   dist_clients = distinct logs.client;
>>> > >   generate
>>> > >    group as hour,
>>> > >    COUNT(dist_clients) as num_dist_clients,
>>> > >    COUNT(logs) as total_requests;
>>> > > }
>>> > >
>>> > > D
>>> > >
>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>> [email protected]
>>> > > >wrote:
>>> > >
>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>> of
>>> > > > requests and of visits by hour.
>>> > > >
>>> > > > I managed to get the requests, but how do I get the visits?
>>> > > >
>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>> > > (line:chararray);
>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>> > > >  FLATTEN(
>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>> > (\\+\\d{4})\\]
>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>> > > >  ) AS (
>>> > > >    client:   chararray,
>>> > > >    username: chararray,
>>> > > >    date: chararray,
>>> > > >    hour: chararray,
>>> > > >    minute: chararray,
>>> > > >    second: chararray,
>>> > > >    timeZone: chararray,
>>> > > >    request:  chararray,
>>> > > >    statusCode: int,
>>> > > >    bytesSent: chararray,
>>> > > >    referer:  chararray,
>>> > > >    userAgent: chararray,
>>> > > >    remoteUser: chararray,
>>> > > >    timeTaken: chararray
>>> > > > );
>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>> > > > DESCRIBE A;
>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>> > > int,bytesSent:
>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>> > > > chararray,timeTaken: chararray)}}
>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>> > > >
>>> > > > How can I now get the distinct count of clients per hour?
>>> > > >
>>> > > > Thanks for your help!
>>> > > >
>>> > > > --
>>> > > > David Riccitelli
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > > InsideOut10 s.r.l.
>>> > > > P.IVA: IT-11381771002
>>> > > > Fax: +39 0110708239
>>> > > > ---
>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > > > Twitter: ziodave
>>> > > > ---
>>> > > > Layar Partner Network<
>>> > > >
>>> > >
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > David Riccitelli
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > > InsideOut10 s.r.l.
>>> > > > P.IVA: IT-11381771002
>>> > > > Fax: +39 0110708239
>>> > > > ---
>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > > > Twitter: ziodave
>>> > > > ---
>>> > > > Layar Partner Network<
>>> > > >
>>> > >
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > David Riccitelli
>>> >
>>> >
>>> >
>>> ********************************************************************************
>>> > InsideOut10 s.r.l.
>>> > P.IVA: IT-11381771002
>>> > Fax: +39 0110708239
>>> > ---
>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > Twitter: ziodave
>>> > ---
>>> > Layar Partner Network<
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > >
>>> >
>>> >
>>> ********************************************************************************
>>> >
>>>
>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner 
>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner 
> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner 
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Reply via email to