I noticed that this issue arises only if I load the initial data with the
TextLoader() and using the REGEX_EXTRACT_ALL.

If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
REGEX_EXTRACT_ALL), it works.

But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...

Does it make sense?

David

On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <[email protected]>wrote:

> I still can't manage to accomplish my objectives. I'm trying to get now the
> max time taken so, as a test, I do:
> grunt> A = GROUP logs BY client;
>
> then (timeTaken is long):
> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>
> when I dump it, I get the following error:
> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
> while computing max in Initial
> (...)
> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
> java.lang.Long
> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>
> Initially I thought that I had some timeTaken not compatible with long data
> type, but I checked and re-checked. I also get the timeTaken as \d+ regular
> expression.
>
> What am I doing wrong?
>
> Thanks!
> David
>
> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <[email protected]>wrote:
>
>> I tried with another log file and that does not happen, so I suppose
>> there's some 'corrupted' line in the one I was testing.
>>
>>
>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <[email protected]>wrote:
>>
>>> There's something strage in the results however:
>>> (00,129,30096)
>>> (01,91,16487)
>>> (02,57,11686)
>>> (03,41,6041)
>>> (04,30,4882)
>>> (05,33,4154)
>>> (06,65,8031)
>>> (07,66,12260)
>>> (08,95,17924)
>>> (09,131,21187)
>>> (10,162,26607)
>>> (11,155,28503)
>>> (12,146,27863)
>>> (13,152,29130)
>>> (14,159,32784)
>>> (15,150,28898)
>>> (16,143,28973)
>>> (17,169,29024)
>>> (18,199,26585)
>>> (19,182,28803)
>>> (20,224,32511)
>>> (21,232,38584)
>>> (22,225,39924)
>>> (23,191,33606)
>>> (,0,0)
>>>
>>>
>>> What is the last line:
>>>  (,0,0)
>>> the count is zero, it shouldn't really be there, correct?
>>>
>>> (Using pig 0.9.0)
>>>
>>>
>>> Thanks,
>>> David
>>>
>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote:
>>>
>>>> Right, that should read "by_hour_client.num_reqs".
>>>>
>>>> Don't trust relative measurements you get for small data on a single
>>>> computer in local mode. Things change when you start running on hundreds
>>>> of
>>>> gigs with real skew on a cluster.
>>>>
>>>> D
>>>>
>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected]
>>>> >wrote:
>>>>
>>>> > Thanks Dmitriy,
>>>> >
>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>> lines).
>>>> > The first method is giving me the following error:
>>>> >
>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>> [num_reqs]
>>>> > does not exist in schema:
>>>> >
>>>> >
>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>> >
>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>> >
>>>> > grunt> by_hour_client =
>>>> > >>  foreach
>>>> > >>    (group logs by (hour, client))
>>>> > >>  generate
>>>> > >>    flatten(group) as (hour, client),
>>>> > >>    COUNT(logs) as num_reqs;
>>>> > grunt> by_hour =
>>>> > >>  foreach
>>>> > >>    (group by_hour_client by hour)
>>>> > >>  generate
>>>> > >>    group as hour,
>>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>>> > >>    SUM(num_reqs) as total_requests;
>>>> >
>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>> as a
>>>> > result of the
>>>> > *    (group by_hour_client by hour)*
>>>> > correct? So I changed the last line to
>>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>>> > and it worked (it took a little more than 29 seconds).
>>>> >
>>>> > Thanks for your help,
>>>> > David
>>>> >
>>>> >
>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]>
>>>> > wrote:
>>>> >
>>>> > > by_hour_client =
>>>> > >  foreach
>>>> > >    (group logs by (hour, client) parallel $p)
>>>> > >  generate
>>>> > >    flatten(group) as (hour, client),
>>>> > >    COUNT(logs) as num_reqs;
>>>> > >
>>>> > > by_hour =
>>>> > >  foreach
>>>> > >    (group by_hour_client by hour parallel $p2)
>>>> > >  generate
>>>> > >    group as hour,
>>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>>> > >    SUM(num_reqs) as total_requests;
>>>> > >
>>>> > > You can also do this using a nested distinct, but depending on what
>>>> your
>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>> pressure
>>>> > on
>>>> > > individual reducers that have to do the inner distinct in memory
>>>> > (although
>>>> > > they do push part of this up to the mappers):
>>>> > >
>>>> > > by_hour =
>>>> > >  foreach (group logs by hour) {
>>>> > >   dist_clients = distinct logs.client;
>>>> > >   generate
>>>> > >    group as hour,
>>>> > >    COUNT(dist_clients) as num_dist_clients,
>>>> > >    COUNT(logs) as total_requests;
>>>> > > }
>>>> > >
>>>> > > D
>>>> > >
>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>> [email protected]
>>>> > > >wrote:
>>>> > >
>>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>>> of
>>>> > > > requests and of visits by hour.
>>>> > > >
>>>> > > > I managed to get the requests, but how do I get the visits?
>>>> > > >
>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>> > > (line:chararray);
>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>> > > >  FLATTEN(
>>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>> > (\\+\\d{4})\\]
>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>> > > >  ) AS (
>>>> > > >    client:   chararray,
>>>> > > >    username: chararray,
>>>> > > >    date: chararray,
>>>> > > >    hour: chararray,
>>>> > > >    minute: chararray,
>>>> > > >    second: chararray,
>>>> > > >    timeZone: chararray,
>>>> > > >    request:  chararray,
>>>> > > >    statusCode: int,
>>>> > > >    bytesSent: chararray,
>>>> > > >    referer:  chararray,
>>>> > > >    userAgent: chararray,
>>>> > > >    remoteUser: chararray,
>>>> > > >    timeTaken: chararray
>>>> > > > );
>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>> > > > DESCRIBE A;
>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>> chararray,second:
>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>> > > int,bytesSent:
>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>> > > > chararray,timeTaken: chararray)}}
>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>> > > >
>>>> > > > How can I now get the distinct count of clients per hour?
>>>> > > >
>>>> > > > Thanks for your help!
>>>> > > >
>>>> > > > --
>>>> > > > David Riccitelli
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > > InsideOut10 s.r.l.
>>>> > > > P.IVA: IT-11381771002
>>>> > > > Fax: +39 0110708239
>>>> > > > ---
>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > > > Twitter: ziodave
>>>> > > > ---
>>>> > > > Layar Partner Network<
>>>> > > >
>>>> > >
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > --
>>>> > > > David Riccitelli
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > > InsideOut10 s.r.l.
>>>> > > > P.IVA: IT-11381771002
>>>> > > > Fax: +39 0110708239
>>>> > > > ---
>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > > > Twitter: ziodave
>>>> > > > ---
>>>> > > > Layar Partner Network<
>>>> > > >
>>>> > >
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > David Riccitelli
>>>> >
>>>> >
>>>> >
>>>> ********************************************************************************
>>>> > InsideOut10 s.r.l.
>>>> > P.IVA: IT-11381771002
>>>> > Fax: +39 0110708239
>>>> > ---
>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > Twitter: ziodave
>>>> > ---
>>>> > Layar Partner Network<
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > >
>>>> >
>>>> >
>>>> ********************************************************************************
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner 
>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner 
>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner 
> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner 
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Reply via email to