Ok, seems that I've been able to solve it changing this (casting to long, *
(bag{tuple(long)})logs.timeTaken*):
by_hour =
foreach (group logs by hour) {
dist_clients = distinct logs.client;
max_time_taken = logs.timeTaken;
generate
group as hour,
COUNT(dist_clients) as num_dist_clients,
COUNT(logs) as total_requests,
MAX( max_time_taken );
};
to this:
by_hour =
foreach (group logs by hour) {
dist_clients = distinct logs.client;
max_time_taken = (bag{tuple(long)})logs.timeTaken;
generate
group as hour,
COUNT(dist_clients) as num_dist_clients,
COUNT(logs) as total_requests,
MAX( max_time_taken );
};
David
On Fri, Aug 19, 2011 at 7:44 PM, David Riccitelli <[email protected]>wrote:
> Sorry for this long sequence of messages, but I'm posting things as I
> continue testing/investigating.
>
> May be this relevant to my case?
>
> http://www.mail-archive.com/[email protected]/msg02258.html
>
> Thanks,
> David
>
>
> On Fri, Aug 19, 2011 at 7:31 PM, David Riccitelli <[email protected]>wrote:
>
>> I tried changing this line, from:
>> RAW_LOGS = LOAD
>> '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' USING
>> TextLoader() AS (line:chararray);
>>
>> to:
>> RAW_LOGS = LOAD
>> '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' USING
>> PigStorage() AS (line:chararray);
>>
>> It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL
>> that produces the logs schema.
>>
>> Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
>> function?
>>
>> Thanks for your help,
>> David
>>
>> On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <[email protected]>wrote:
>>
>>> I noticed that this issue arises only if I load the initial data with the
>>> TextLoader() and using the REGEX_EXTRACT_ALL.
>>>
>>> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
>>> REGEX_EXTRACT_ALL), it works.
>>>
>>> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>>>
>>> Does it make sense?
>>>
>>> David
>>>
>>>
>>> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <[email protected]>wrote:
>>>
>>>> I still can't manage to accomplish my objectives. I'm trying to get now
>>>> the max time taken so, as a test, I do:
>>>> grunt> A = GROUP logs BY client;
>>>>
>>>> then (timeTaken is long):
>>>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>>>
>>>> when I dump it, I get the following error:
>>>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>>>> while computing max in Initial
>>>> (...)
>>>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>>>> to java.lang.Long
>>>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>>>
>>>> Initially I thought that I had some timeTaken not compatible with long
>>>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>>>> regular expression.
>>>>
>>>> What am I doing wrong?
>>>>
>>>> Thanks!
>>>> David
>>>>
>>>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli
>>>> <[email protected]>wrote:
>>>>
>>>>> I tried with another log file and that does not happen, so I suppose
>>>>> there's some 'corrupted' line in the one I was testing.
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> There's something strage in the results however:
>>>>>> (00,129,30096)
>>>>>> (01,91,16487)
>>>>>> (02,57,11686)
>>>>>> (03,41,6041)
>>>>>> (04,30,4882)
>>>>>> (05,33,4154)
>>>>>> (06,65,8031)
>>>>>> (07,66,12260)
>>>>>> (08,95,17924)
>>>>>> (09,131,21187)
>>>>>> (10,162,26607)
>>>>>> (11,155,28503)
>>>>>> (12,146,27863)
>>>>>> (13,152,29130)
>>>>>> (14,159,32784)
>>>>>> (15,150,28898)
>>>>>> (16,143,28973)
>>>>>> (17,169,29024)
>>>>>> (18,199,26585)
>>>>>> (19,182,28803)
>>>>>> (20,224,32511)
>>>>>> (21,232,38584)
>>>>>> (22,225,39924)
>>>>>> (23,191,33606)
>>>>>> (,0,0)
>>>>>>
>>>>>>
>>>>>> What is the last line:
>>>>>> (,0,0)
>>>>>> the count is zero, it shouldn't really be there, correct?
>>>>>>
>>>>>> (Using pig 0.9.0)
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>>>
>>>>>>> Don't trust relative measurements you get for small data on a single
>>>>>>> computer in local mode. Things change when you start running on
>>>>>>> hundreds of
>>>>>>> gigs with real skew on a cluster.
>>>>>>>
>>>>>>> D
>>>>>>>
>>>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <
>>>>>>> [email protected]>wrote:
>>>>>>>
>>>>>>> > Thanks Dmitriy,
>>>>>>> >
>>>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>>>> lines).
>>>>>>> > The first method is giving me the following error:
>>>>>>> >
>>>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>>>> [num_reqs]
>>>>>>> > does not exist in schema:
>>>>>>> >
>>>>>>> >
>>>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>>>> >
>>>>>>> > when I try to set the by_hour (after having set the
>>>>>>> by_hour_client):
>>>>>>> >
>>>>>>> > grunt> by_hour_client =
>>>>>>> > >> foreach
>>>>>>> > >> (group logs by (hour, client))
>>>>>>> > >> generate
>>>>>>> > >> flatten(group) as (hour, client),
>>>>>>> > >> COUNT(logs) as num_reqs;
>>>>>>> > grunt> by_hour =
>>>>>>> > >> foreach
>>>>>>> > >> (group by_hour_client by hour)
>>>>>>> > >> generate
>>>>>>> > >> group as hour,
>>>>>>> > >> COUNT(by_hour_client) as num_dist_clients,
>>>>>>> > >> SUM(num_reqs) as total_requests;
>>>>>>> >
>>>>>>> > If I understood correctly that's because the num_reqs is in the
>>>>>>> bag, as a
>>>>>>> > result of the
>>>>>>> > * (group by_hour_client by hour)*
>>>>>>> > correct? So I changed the last line to
>>>>>>> > *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>>>> > and it worked (it took a little more than 29 seconds).
>>>>>>> >
>>>>>>> > Thanks for your help,
>>>>>>> > David
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <
>>>>>>> [email protected]>
>>>>>>> > wrote:
>>>>>>> >
>>>>>>> > > by_hour_client =
>>>>>>> > > foreach
>>>>>>> > > (group logs by (hour, client) parallel $p)
>>>>>>> > > generate
>>>>>>> > > flatten(group) as (hour, client),
>>>>>>> > > COUNT(logs) as num_reqs;
>>>>>>> > >
>>>>>>> > > by_hour =
>>>>>>> > > foreach
>>>>>>> > > (group by_hour_client by hour parallel $p2)
>>>>>>> > > generate
>>>>>>> > > group as hour,
>>>>>>> > > COUNT(by_hour_client) as num_dist_clients,
>>>>>>> > > SUM(num_reqs) as total_requests;
>>>>>>> > >
>>>>>>> > > You can also do this using a nested distinct, but depending on
>>>>>>> what your
>>>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>>>> pressure
>>>>>>> > on
>>>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>>>> > (although
>>>>>>> > > they do push part of this up to the mappers):
>>>>>>> > >
>>>>>>> > > by_hour =
>>>>>>> > > foreach (group logs by hour) {
>>>>>>> > > dist_clients = distinct logs.client;
>>>>>>> > > generate
>>>>>>> > > group as hour,
>>>>>>> > > COUNT(dist_clients) as num_dist_clients,
>>>>>>> > > COUNT(logs) as total_requests;
>>>>>>> > > }
>>>>>>> > >
>>>>>>> > > D
>>>>>>> > >
>>>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>>>> [email protected]
>>>>>>> > > >wrote:
>>>>>>> > >
>>>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the
>>>>>>> number of
>>>>>>> > > > requests and of visits by hour.
>>>>>>> > > >
>>>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>>>> > > >
>>>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>>>> > > (line:chararray);
>>>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>>>> > > > FLATTEN(
>>>>>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>>>> > (\\+\\d{4})\\]
>>>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>>>> > > > ) AS (
>>>>>>> > > > client: chararray,
>>>>>>> > > > username: chararray,
>>>>>>> > > > date: chararray,
>>>>>>> > > > hour: chararray,
>>>>>>> > > > minute: chararray,
>>>>>>> > > > second: chararray,
>>>>>>> > > > timeZone: chararray,
>>>>>>> > > > request: chararray,
>>>>>>> > > > statusCode: int,
>>>>>>> > > > bytesSent: chararray,
>>>>>>> > > > referer: chararray,
>>>>>>> > > > userAgent: chararray,
>>>>>>> > > > remoteUser: chararray,
>>>>>>> > > > timeTaken: chararray
>>>>>>> > > > );
>>>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>>>> > > > DESCRIBE A;
>>>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>>>> chararray,second:
>>>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>>>> > > int,bytesSent:
>>>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>>>> > > > chararray,timeTaken: chararray)}}
>>>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>>>> > > >
>>>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>>>> > > >
>>>>>>> > > > Thanks for your help!
>>>>>>> > > >
>>>>>>> > > > --
>>>>>>> > > > David Riccitelli
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > > InsideOut10 s.r.l.
>>>>>>> > > > P.IVA: IT-11381771002
>>>>>>> > > > Fax: +39 0110708239
>>>>>>> > > > ---
>>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > > > Twitter: ziodave
>>>>>>> > > > ---
>>>>>>> > > > Layar Partner Network<
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > --
>>>>>>> > > > David Riccitelli
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > > InsideOut10 s.r.l.
>>>>>>> > > > P.IVA: IT-11381771002
>>>>>>> > > > Fax: +39 0110708239
>>>>>>> > > > ---
>>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > > > Twitter: ziodave
>>>>>>> > > > ---
>>>>>>> > > > Layar Partner Network<
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > David Riccitelli
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > InsideOut10 s.r.l.
>>>>>>> > P.IVA: IT-11381771002
>>>>>>> > Fax: +39 0110708239
>>>>>>> > ---
>>>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > Twitter: ziodave
>>>>>>> > ---
>>>>>>> > Layar Partner Network<
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > >
>>>>>>> >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Riccitelli
>>>>>>
>>>>>>
>>>>>> ********************************************************************************
>>>>>> InsideOut10 s.r.l.
>>>>>> P.IVA: IT-11381771002
>>>>>> Fax: +39 0110708239
>>>>>> ---
>>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> Twitter: ziodave
>>>>>> ---
>>>>>> Layar Partner
>>>>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>>
>>>>>> ********************************************************************************
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Riccitelli
>>>>>
>>>>>
>>>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner
>>>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>
>>>>> ********************************************************************************
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner
>>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner
>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner
>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner
> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************