I tried changing this line, from: RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' USING TextLoader() AS (line:chararray);
to: RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' USING PigStorage() AS (line:chararray); It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL that produces the logs schema. Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX function? Thanks for your help, David On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <[email protected]>wrote: > I noticed that this issue arises only if I load the initial data with the > TextLoader() and using the REGEX_EXTRACT_ALL. > > If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o > REGEX_EXTRACT_ALL), it works. > > But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines... > > Does it make sense? > > David > > > On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <[email protected]>wrote: > >> I still can't manage to accomplish my objectives. I'm trying to get now >> the max time taken so, as a test, I do: >> grunt> A = GROUP logs BY client; >> >> then (timeTaken is long): >> B = FOREACH A GENERATE group, MAX( logs.timeTaken ); >> >> when I dump it, I get the following error: >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error >> while computing max in Initial >> (...) >> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast >> to java.lang.Long >> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76) >> >> Initially I thought that I had some timeTaken not compatible with long >> data type, but I checked and re-checked. I also get the timeTaken as \d+ >> regular expression. >> >> What am I doing wrong? >> >> Thanks! >> David >> >> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <[email protected]>wrote: >> >>> I tried with another log file and that does not happen, so I suppose >>> there's some 'corrupted' line in the one I was testing. >>> >>> >>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <[email protected]>wrote: >>> >>>> There's something strage in the results however: >>>> (00,129,30096) >>>> (01,91,16487) >>>> (02,57,11686) >>>> (03,41,6041) >>>> (04,30,4882) >>>> (05,33,4154) >>>> (06,65,8031) >>>> (07,66,12260) >>>> (08,95,17924) >>>> (09,131,21187) >>>> (10,162,26607) >>>> (11,155,28503) >>>> (12,146,27863) >>>> (13,152,29130) >>>> (14,159,32784) >>>> (15,150,28898) >>>> (16,143,28973) >>>> (17,169,29024) >>>> (18,199,26585) >>>> (19,182,28803) >>>> (20,224,32511) >>>> (21,232,38584) >>>> (22,225,39924) >>>> (23,191,33606) >>>> (,0,0) >>>> >>>> >>>> What is the last line: >>>> (,0,0) >>>> the count is zero, it shouldn't really be there, correct? >>>> >>>> (Using pig 0.9.0) >>>> >>>> >>>> Thanks, >>>> David >>>> >>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote: >>>> >>>>> Right, that should read "by_hour_client.num_reqs". >>>>> >>>>> Don't trust relative measurements you get for small data on a single >>>>> computer in local mode. Things change when you start running on >>>>> hundreds of >>>>> gigs with real skew on a cluster. >>>>> >>>>> D >>>>> >>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected] >>>>> >wrote: >>>>> >>>>> > Thanks Dmitriy, >>>>> > >>>>> > The second method took less than 26 secs. on my computer (~550.000 >>>>> lines). >>>>> > The first method is giving me the following error: >>>>> > >>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: >>>>> > <line 34, column 7> Invalid field projection. Projected field >>>>> [num_reqs] >>>>> > does not exist in schema: >>>>> > >>>>> > >>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}. >>>>> > >>>>> > when I try to set the by_hour (after having set the by_hour_client): >>>>> > >>>>> > grunt> by_hour_client = >>>>> > >> foreach >>>>> > >> (group logs by (hour, client)) >>>>> > >> generate >>>>> > >> flatten(group) as (hour, client), >>>>> > >> COUNT(logs) as num_reqs; >>>>> > grunt> by_hour = >>>>> > >> foreach >>>>> > >> (group by_hour_client by hour) >>>>> > >> generate >>>>> > >> group as hour, >>>>> > >> COUNT(by_hour_client) as num_dist_clients, >>>>> > >> SUM(num_reqs) as total_requests; >>>>> > >>>>> > If I understood correctly that's because the num_reqs is in the bag, >>>>> as a >>>>> > result of the >>>>> > * (group by_hour_client by hour)* >>>>> > correct? So I changed the last line to >>>>> > *SUM(by_hour_client.num_reqs) as total_requests;* >>>>> > and it worked (it took a little more than 29 seconds). >>>>> > >>>>> > Thanks for your help, >>>>> > David >>>>> > >>>>> > >>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]> >>>>> > wrote: >>>>> > >>>>> > > by_hour_client = >>>>> > > foreach >>>>> > > (group logs by (hour, client) parallel $p) >>>>> > > generate >>>>> > > flatten(group) as (hour, client), >>>>> > > COUNT(logs) as num_reqs; >>>>> > > >>>>> > > by_hour = >>>>> > > foreach >>>>> > > (group by_hour_client by hour parallel $p2) >>>>> > > generate >>>>> > > group as hour, >>>>> > > COUNT(by_hour_client) as num_dist_clients, >>>>> > > SUM(num_reqs) as total_requests; >>>>> > > >>>>> > > You can also do this using a nested distinct, but depending on what >>>>> your >>>>> > > data looks like, it might be a bad idea, as it can put a lot of >>>>> pressure >>>>> > on >>>>> > > individual reducers that have to do the inner distinct in memory >>>>> > (although >>>>> > > they do push part of this up to the mappers): >>>>> > > >>>>> > > by_hour = >>>>> > > foreach (group logs by hour) { >>>>> > > dist_clients = distinct logs.client; >>>>> > > generate >>>>> > > group as hour, >>>>> > > COUNT(dist_clients) as num_dist_clients, >>>>> > > COUNT(logs) as total_requests; >>>>> > > } >>>>> > > >>>>> > > D >>>>> > > >>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli < >>>>> [email protected] >>>>> > > >wrote: >>>>> > > >>>>> > > > I'm analyzing a daily apache log file. I'd like to get the number >>>>> of >>>>> > > > requests and of visits by hour. >>>>> > > > >>>>> > > > I managed to get the requests, but how do I get the visits? >>>>> > > > >>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS >>>>> > > (line:chararray); >>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE >>>>> > > > FLATTEN( >>>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+) >>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) >>>>> > (\\+\\d{4})\\] >>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)') >>>>> > > > ) AS ( >>>>> > > > client: chararray, >>>>> > > > username: chararray, >>>>> > > > date: chararray, >>>>> > > > hour: chararray, >>>>> > > > minute: chararray, >>>>> > > > second: chararray, >>>>> > > > timeZone: chararray, >>>>> > > > request: chararray, >>>>> > > > statusCode: int, >>>>> > > > bytesSent: chararray, >>>>> > > > referer: chararray, >>>>> > > > userAgent: chararray, >>>>> > > > remoteUser: chararray, >>>>> > > > timeTaken: chararray >>>>> > > > ); >>>>> > > > grunt> A = GROUP LOGS_BASE BY hour; >>>>> > > > DESCRIBE A; >>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username: >>>>> > > > chararray,date: chararray,hour: chararray,minute: >>>>> chararray,second: >>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode: >>>>> > > int,bytesSent: >>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser: >>>>> > > > chararray,timeTaken: chararray)}} >>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 ); >>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour >>>>> > > > >>>>> > > > How can I now get the distinct count of clients per hour? >>>>> > > > >>>>> > > > Thanks for your help! >>>>> > > > >>>>> > > > -- >>>>> > > > David Riccitelli >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > >>>>> > >>>>> ******************************************************************************** >>>>> > > > InsideOut10 s.r.l. >>>>> > > > P.IVA: IT-11381771002 >>>>> > > > Fax: +39 0110708239 >>>>> > > > --- >>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >>>>> > > > Twitter: ziodave >>>>> > > > --- >>>>> > > > Layar Partner Network< >>>>> > > > >>>>> > > >>>>> > >>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>>> > > > > >>>>> > > > >>>>> > > > >>>>> > > >>>>> > >>>>> ******************************************************************************** >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > -- >>>>> > > > David Riccitelli >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > >>>>> > >>>>> ******************************************************************************** >>>>> > > > InsideOut10 s.r.l. >>>>> > > > P.IVA: IT-11381771002 >>>>> > > > Fax: +39 0110708239 >>>>> > > > --- >>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >>>>> > > > Twitter: ziodave >>>>> > > > --- >>>>> > > > Layar Partner Network< >>>>> > > > >>>>> > > >>>>> > >>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>>> > > > > >>>>> > > > >>>>> > > > >>>>> > > >>>>> > >>>>> ******************************************************************************** >>>>> > > > >>>>> > > >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > David Riccitelli >>>>> > >>>>> > >>>>> > >>>>> ******************************************************************************** >>>>> > InsideOut10 s.r.l. >>>>> > P.IVA: IT-11381771002 >>>>> > Fax: +39 0110708239 >>>>> > --- >>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli >>>>> > Twitter: ziodave >>>>> > --- >>>>> > Layar Partner Network< >>>>> > >>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>>> > > >>>>> > >>>>> > >>>>> ******************************************************************************** >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> David Riccitelli >>>> >>>> >>>> ******************************************************************************** >>>> InsideOut10 s.r.l. >>>> P.IVA: IT-11381771002 >>>> Fax: +39 0110708239 >>>> --- >>>> LinkedIn: http://it.linkedin.com/in/riccitelli >>>> Twitter: ziodave >>>> --- >>>> Layar Partner >>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >>>> >>>> ******************************************************************************** >>>> >>>> >>> >>> >>> -- >>> David Riccitelli >>> >>> >>> ******************************************************************************** >>> InsideOut10 s.r.l. >>> P.IVA: IT-11381771002 >>> Fax: +39 0110708239 >>> --- >>> LinkedIn: http://it.linkedin.com/in/riccitelli >>> Twitter: ziodave >>> --- >>> Layar Partner >>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >>> >>> ******************************************************************************** >>> >>> >> >> >> -- >> David Riccitelli >> >> >> ******************************************************************************** >> InsideOut10 s.r.l. >> P.IVA: IT-11381771002 >> Fax: +39 0110708239 >> --- >> LinkedIn: http://it.linkedin.com/in/riccitelli >> Twitter: ziodave >> --- >> Layar Partner >> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >> >> ******************************************************************************** >> >> > > > -- > David Riccitelli > > > ******************************************************************************** > InsideOut10 s.r.l. > P.IVA: IT-11381771002 > Fax: +39 0110708239 > --- > LinkedIn: http://it.linkedin.com/in/riccitelli > Twitter: ziodave > --- > Layar Partner > Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> > > ******************************************************************************** > > -- David Riccitelli ******************************************************************************** InsideOut10 s.r.l. P.IVA: IT-11381771002 Fax: +39 0110708239 --- LinkedIn: http://it.linkedin.com/in/riccitelli Twitter: ziodave --- Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> ********************************************************************************
