I still can't manage to accomplish my objectives. I'm trying to get now the max time taken so, as a test, I do: grunt> A = GROUP logs BY client;
then (timeTaken is long): B = FOREACH A GENERATE group, MAX( logs.timeTaken ); when I dump it, I get the following error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing max in Initial (...) Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76) Initially I thought that I had some timeTaken not compatible with long data type, but I checked and re-checked. I also get the timeTaken as \d+ regular expression. What am I doing wrong? Thanks! David On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <[email protected]>wrote: > I tried with another log file and that does not happen, so I suppose > there's some 'corrupted' line in the one I was testing. > > > On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <[email protected]>wrote: > >> There's something strage in the results however: >> (00,129,30096) >> (01,91,16487) >> (02,57,11686) >> (03,41,6041) >> (04,30,4882) >> (05,33,4154) >> (06,65,8031) >> (07,66,12260) >> (08,95,17924) >> (09,131,21187) >> (10,162,26607) >> (11,155,28503) >> (12,146,27863) >> (13,152,29130) >> (14,159,32784) >> (15,150,28898) >> (16,143,28973) >> (17,169,29024) >> (18,199,26585) >> (19,182,28803) >> (20,224,32511) >> (21,232,38584) >> (22,225,39924) >> (23,191,33606) >> (,0,0) >> >> >> What is the last line: >> (,0,0) >> the count is zero, it shouldn't really be there, correct? >> >> (Using pig 0.9.0) >> >> >> Thanks, >> David >> >> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote: >> >>> Right, that should read "by_hour_client.num_reqs". >>> >>> Don't trust relative measurements you get for small data on a single >>> computer in local mode. Things change when you start running on hundreds >>> of >>> gigs with real skew on a cluster. >>> >>> D >>> >>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected] >>> >wrote: >>> >>> > Thanks Dmitriy, >>> > >>> > The second method took less than 26 secs. on my computer (~550.000 >>> lines). >>> > The first method is giving me the following error: >>> > >>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: >>> > <line 34, column 7> Invalid field projection. Projected field >>> [num_reqs] >>> > does not exist in schema: >>> > >>> > >>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}. >>> > >>> > when I try to set the by_hour (after having set the by_hour_client): >>> > >>> > grunt> by_hour_client = >>> > >> foreach >>> > >> (group logs by (hour, client)) >>> > >> generate >>> > >> flatten(group) as (hour, client), >>> > >> COUNT(logs) as num_reqs; >>> > grunt> by_hour = >>> > >> foreach >>> > >> (group by_hour_client by hour) >>> > >> generate >>> > >> group as hour, >>> > >> COUNT(by_hour_client) as num_dist_clients, >>> > >> SUM(num_reqs) as total_requests; >>> > >>> > If I understood correctly that's because the num_reqs is in the bag, as >>> a >>> > result of the >>> > * (group by_hour_client by hour)* >>> > correct? So I changed the last line to >>> > *SUM(by_hour_client.num_reqs) as total_requests;* >>> > and it worked (it took a little more than 29 seconds). >>> > >>> > Thanks for your help, >>> > David >>> > >>> > >>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]> >>> > wrote: >>> > >>> > > by_hour_client = >>> > > foreach >>> > > (group logs by (hour, client) parallel $p) >>> > > generate >>> > > flatten(group) as (hour, client), >>> > > COUNT(logs) as num_reqs; >>> > > >>> > > by_hour = >>> > > foreach >>> > > (group by_hour_client by hour parallel $p2) >>> > > generate >>> > > group as hour, >>> > > COUNT(by_hour_client) as num_dist_clients, >>> > > SUM(num_reqs) as total_requests; >>> > > >>> > > You can also do this using a nested distinct, but depending on what >>> your >>> > > data looks like, it might be a bad idea, as it can put a lot of >>> pressure >>> > on >>> > > individual reducers that have to do the inner distinct in memory >>> > (although >>> > > they do push part of this up to the mappers): >>> > > >>> > > by_hour = >>> > > foreach (group logs by hour) { >>> > > dist_clients = distinct logs.client; >>> > > generate >>> > > group as hour, >>> > > COUNT(dist_clients) as num_dist_clients, >>> > > COUNT(logs) as total_requests; >>> > > } >>> > > >>> > > D >>> > > >>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli < >>> [email protected] >>> > > >wrote: >>> > > >>> > > > I'm analyzing a daily apache log file. I'd like to get the number >>> of >>> > > > requests and of visits by hour. >>> > > > >>> > > > I managed to get the requests, but how do I get the visits? >>> > > > >>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS >>> > > (line:chararray); >>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE >>> > > > FLATTEN( >>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+) >>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) >>> > (\\+\\d{4})\\] >>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)') >>> > > > ) AS ( >>> > > > client: chararray, >>> > > > username: chararray, >>> > > > date: chararray, >>> > > > hour: chararray, >>> > > > minute: chararray, >>> > > > second: chararray, >>> > > > timeZone: chararray, >>> > > > request: chararray, >>> > > > statusCode: int, >>> > > > bytesSent: chararray, >>> > > > referer: chararray, >>> > > > userAgent: chararray, >>> > > > remoteUser: chararray, >>> > > > timeTaken: chararray >>> > > > ); >>> > > > grunt> A = GROUP LOGS_BASE BY hour; >>> > > > DESCRIBE A; >>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username: >>> > > > chararray,date: chararray,hour: chararray,minute: chararray,second: >>> > > > chararray,timeZone: chararray,request: chararray,statusCode: >>> > > int,bytesSent: >>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser: >>> > > > chararray,timeTaken: chararray)}} >>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 ); >>> > > > grunt> C = ORDER B BY hour; -- requests by hour >>> > > > >>> > > > How can I now get the distinct count of clients per hour? >>> > > > >>> > > > Thanks for your help! >>> > > > >>> > > > -- >>> > > > David Riccitelli >>> > > > >>> > > > >>> > > > >>> > > >>> > >>> ******************************************************************************** >>> > > > InsideOut10 s.r.l. >>> > > > P.IVA: IT-11381771002 >>> > > > Fax: +39 0110708239 >>> > > > --- >>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >>> > > > Twitter: ziodave >>> > > > --- >>> > > > Layar Partner Network< >>> > > > >>> > > >>> > >>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>> > > > > >>> > > > >>> > > > >>> > > >>> > >>> ******************************************************************************** >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > -- >>> > > > David Riccitelli >>> > > > >>> > > > >>> > > > >>> > > >>> > >>> ******************************************************************************** >>> > > > InsideOut10 s.r.l. >>> > > > P.IVA: IT-11381771002 >>> > > > Fax: +39 0110708239 >>> > > > --- >>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >>> > > > Twitter: ziodave >>> > > > --- >>> > > > Layar Partner Network< >>> > > > >>> > > >>> > >>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>> > > > > >>> > > > >>> > > > >>> > > >>> > >>> ******************************************************************************** >>> > > > >>> > > >>> > >>> > >>> > >>> > -- >>> > David Riccitelli >>> > >>> > >>> > >>> ******************************************************************************** >>> > InsideOut10 s.r.l. >>> > P.IVA: IT-11381771002 >>> > Fax: +39 0110708239 >>> > --- >>> > LinkedIn: http://it.linkedin.com/in/riccitelli >>> > Twitter: ziodave >>> > --- >>> > Layar Partner Network< >>> > >>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>> > > >>> > >>> > >>> ******************************************************************************** >>> > >>> >> >> >> >> -- >> David Riccitelli >> >> >> ******************************************************************************** >> InsideOut10 s.r.l. >> P.IVA: IT-11381771002 >> Fax: +39 0110708239 >> --- >> LinkedIn: http://it.linkedin.com/in/riccitelli >> Twitter: ziodave >> --- >> Layar Partner >> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >> >> ******************************************************************************** >> >> > > > -- > David Riccitelli > > > ******************************************************************************** > InsideOut10 s.r.l. > P.IVA: IT-11381771002 > Fax: +39 0110708239 > --- > LinkedIn: http://it.linkedin.com/in/riccitelli > Twitter: ziodave > --- > Layar Partner > Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> > > ******************************************************************************** > > -- David Riccitelli ******************************************************************************** InsideOut10 s.r.l. P.IVA: IT-11381771002 Fax: +39 0110708239 --- LinkedIn: http://it.linkedin.com/in/riccitelli Twitter: ziodave --- Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> ********************************************************************************
