I tried changing this line, from:
RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
USING TextLoader() AS (line:chararray);

to:
RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
USING PigStorage() AS (line:chararray);

It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL that
produces the logs schema.

Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
function?

Thanks for your help,
David

On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <[email protected]>wrote:

> I noticed that this issue arises only if I load the initial data with the
> TextLoader() and using the REGEX_EXTRACT_ALL.
>
> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
> REGEX_EXTRACT_ALL), it works.
>
> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>
> Does it make sense?
>
> David
>
>
> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <[email protected]>wrote:
>
>> I still can't manage to accomplish my objectives. I'm trying to get now
>> the max time taken so, as a test, I do:
>> grunt> A = GROUP logs BY client;
>>
>> then (timeTaken is long):
>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>
>> when I dump it, I get the following error:
>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>> while computing max in Initial
>> (...)
>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>> to java.lang.Long
>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>
>> Initially I thought that I had some timeTaken not compatible with long
>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>> regular expression.
>>
>> What am I doing wrong?
>>
>> Thanks!
>> David
>>
>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <[email protected]>wrote:
>>
>>> I tried with another log file and that does not happen, so I suppose
>>> there's some 'corrupted' line in the one I was testing.
>>>
>>>
>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <[email protected]>wrote:
>>>
>>>> There's something strage in the results however:
>>>> (00,129,30096)
>>>> (01,91,16487)
>>>> (02,57,11686)
>>>> (03,41,6041)
>>>> (04,30,4882)
>>>> (05,33,4154)
>>>> (06,65,8031)
>>>> (07,66,12260)
>>>> (08,95,17924)
>>>> (09,131,21187)
>>>> (10,162,26607)
>>>> (11,155,28503)
>>>> (12,146,27863)
>>>> (13,152,29130)
>>>> (14,159,32784)
>>>> (15,150,28898)
>>>> (16,143,28973)
>>>> (17,169,29024)
>>>> (18,199,26585)
>>>> (19,182,28803)
>>>> (20,224,32511)
>>>> (21,232,38584)
>>>> (22,225,39924)
>>>> (23,191,33606)
>>>> (,0,0)
>>>>
>>>>
>>>> What is the last line:
>>>>  (,0,0)
>>>> the count is zero, it shouldn't really be there, correct?
>>>>
>>>> (Using pig 0.9.0)
>>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote:
>>>>
>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>
>>>>> Don't trust relative measurements you get for small data on a single
>>>>> computer in local mode. Things change when you start running on
>>>>> hundreds of
>>>>> gigs with real skew on a cluster.
>>>>>
>>>>> D
>>>>>
>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected]
>>>>> >wrote:
>>>>>
>>>>> > Thanks Dmitriy,
>>>>> >
>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>> lines).
>>>>> > The first method is giving me the following error:
>>>>> >
>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>> [num_reqs]
>>>>> > does not exist in schema:
>>>>> >
>>>>> >
>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>> >
>>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>>> >
>>>>> > grunt> by_hour_client =
>>>>> > >>  foreach
>>>>> > >>    (group logs by (hour, client))
>>>>> > >>  generate
>>>>> > >>    flatten(group) as (hour, client),
>>>>> > >>    COUNT(logs) as num_reqs;
>>>>> > grunt> by_hour =
>>>>> > >>  foreach
>>>>> > >>    (group by_hour_client by hour)
>>>>> > >>  generate
>>>>> > >>    group as hour,
>>>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>>>> > >>    SUM(num_reqs) as total_requests;
>>>>> >
>>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>>> as a
>>>>> > result of the
>>>>> > *    (group by_hour_client by hour)*
>>>>> > correct? So I changed the last line to
>>>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>> > and it worked (it took a little more than 29 seconds).
>>>>> >
>>>>> > Thanks for your help,
>>>>> > David
>>>>> >
>>>>> >
>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]>
>>>>> > wrote:
>>>>> >
>>>>> > > by_hour_client =
>>>>> > >  foreach
>>>>> > >    (group logs by (hour, client) parallel $p)
>>>>> > >  generate
>>>>> > >    flatten(group) as (hour, client),
>>>>> > >    COUNT(logs) as num_reqs;
>>>>> > >
>>>>> > > by_hour =
>>>>> > >  foreach
>>>>> > >    (group by_hour_client by hour parallel $p2)
>>>>> > >  generate
>>>>> > >    group as hour,
>>>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>>>> > >    SUM(num_reqs) as total_requests;
>>>>> > >
>>>>> > > You can also do this using a nested distinct, but depending on what
>>>>> your
>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>> pressure
>>>>> > on
>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>> > (although
>>>>> > > they do push part of this up to the mappers):
>>>>> > >
>>>>> > > by_hour =
>>>>> > >  foreach (group logs by hour) {
>>>>> > >   dist_clients = distinct logs.client;
>>>>> > >   generate
>>>>> > >    group as hour,
>>>>> > >    COUNT(dist_clients) as num_dist_clients,
>>>>> > >    COUNT(logs) as total_requests;
>>>>> > > }
>>>>> > >
>>>>> > > D
>>>>> > >
>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>> [email protected]
>>>>> > > >wrote:
>>>>> > >
>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>>>> of
>>>>> > > > requests and of visits by hour.
>>>>> > > >
>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>> > > >
>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>> > > (line:chararray);
>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>> > > >  FLATTEN(
>>>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>> > (\\+\\d{4})\\]
>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>> > > >  ) AS (
>>>>> > > >    client:   chararray,
>>>>> > > >    username: chararray,
>>>>> > > >    date: chararray,
>>>>> > > >    hour: chararray,
>>>>> > > >    minute: chararray,
>>>>> > > >    second: chararray,
>>>>> > > >    timeZone: chararray,
>>>>> > > >    request:  chararray,
>>>>> > > >    statusCode: int,
>>>>> > > >    bytesSent: chararray,
>>>>> > > >    referer:  chararray,
>>>>> > > >    userAgent: chararray,
>>>>> > > >    remoteUser: chararray,
>>>>> > > >    timeTaken: chararray
>>>>> > > > );
>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>> > > > DESCRIBE A;
>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>> chararray,second:
>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>> > > int,bytesSent:
>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>> > > > chararray,timeTaken: chararray)}}
>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>> > > >
>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>> > > >
>>>>> > > > Thanks for your help!
>>>>> > > >
>>>>> > > > --
>>>>> > > > David Riccitelli
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > > InsideOut10 s.r.l.
>>>>> > > > P.IVA: IT-11381771002
>>>>> > > > Fax: +39 0110708239
>>>>> > > > ---
>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > > > Twitter: ziodave
>>>>> > > > ---
>>>>> > > > Layar Partner Network<
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > > > --
>>>>> > > > David Riccitelli
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > > InsideOut10 s.r.l.
>>>>> > > > P.IVA: IT-11381771002
>>>>> > > > Fax: +39 0110708239
>>>>> > > > ---
>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > > > Twitter: ziodave
>>>>> > > > ---
>>>>> > > > Layar Partner Network<
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > David Riccitelli
>>>>> >
>>>>> >
>>>>> >
>>>>> ********************************************************************************
>>>>> > InsideOut10 s.r.l.
>>>>> > P.IVA: IT-11381771002
>>>>> > Fax: +39 0110708239
>>>>> > ---
>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > Twitter: ziodave
>>>>> > ---
>>>>> > Layar Partner Network<
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > >
>>>>> >
>>>>> >
>>>>> ********************************************************************************
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner 
>>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner 
>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner 
>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner 
> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner 
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Reply via email to