I tried with another log file and that does not happen, so I suppose there's some 'corrupted' line in the one I was testing.
On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <[email protected]>wrote: > There's something strage in the results however: > (00,129,30096) > (01,91,16487) > (02,57,11686) > (03,41,6041) > (04,30,4882) > (05,33,4154) > (06,65,8031) > (07,66,12260) > (08,95,17924) > (09,131,21187) > (10,162,26607) > (11,155,28503) > (12,146,27863) > (13,152,29130) > (14,159,32784) > (15,150,28898) > (16,143,28973) > (17,169,29024) > (18,199,26585) > (19,182,28803) > (20,224,32511) > (21,232,38584) > (22,225,39924) > (23,191,33606) > (,0,0) > > > What is the last line: > (,0,0) > the count is zero, it shouldn't really be there, correct? > > (Using pig 0.9.0) > > > Thanks, > David > > On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote: > >> Right, that should read "by_hour_client.num_reqs". >> >> Don't trust relative measurements you get for small data on a single >> computer in local mode. Things change when you start running on hundreds >> of >> gigs with real skew on a cluster. >> >> D >> >> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected] >> >wrote: >> >> > Thanks Dmitriy, >> > >> > The second method took less than 26 secs. on my computer (~550.000 >> lines). >> > The first method is giving me the following error: >> > >> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: >> > <line 34, column 7> Invalid field projection. Projected field [num_reqs] >> > does not exist in schema: >> > >> > >> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}. >> > >> > when I try to set the by_hour (after having set the by_hour_client): >> > >> > grunt> by_hour_client = >> > >> foreach >> > >> (group logs by (hour, client)) >> > >> generate >> > >> flatten(group) as (hour, client), >> > >> COUNT(logs) as num_reqs; >> > grunt> by_hour = >> > >> foreach >> > >> (group by_hour_client by hour) >> > >> generate >> > >> group as hour, >> > >> COUNT(by_hour_client) as num_dist_clients, >> > >> SUM(num_reqs) as total_requests; >> > >> > If I understood correctly that's because the num_reqs is in the bag, as >> a >> > result of the >> > * (group by_hour_client by hour)* >> > correct? So I changed the last line to >> > *SUM(by_hour_client.num_reqs) as total_requests;* >> > and it worked (it took a little more than 29 seconds). >> > >> > Thanks for your help, >> > David >> > >> > >> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]> >> > wrote: >> > >> > > by_hour_client = >> > > foreach >> > > (group logs by (hour, client) parallel $p) >> > > generate >> > > flatten(group) as (hour, client), >> > > COUNT(logs) as num_reqs; >> > > >> > > by_hour = >> > > foreach >> > > (group by_hour_client by hour parallel $p2) >> > > generate >> > > group as hour, >> > > COUNT(by_hour_client) as num_dist_clients, >> > > SUM(num_reqs) as total_requests; >> > > >> > > You can also do this using a nested distinct, but depending on what >> your >> > > data looks like, it might be a bad idea, as it can put a lot of >> pressure >> > on >> > > individual reducers that have to do the inner distinct in memory >> > (although >> > > they do push part of this up to the mappers): >> > > >> > > by_hour = >> > > foreach (group logs by hour) { >> > > dist_clients = distinct logs.client; >> > > generate >> > > group as hour, >> > > COUNT(dist_clients) as num_dist_clients, >> > > COUNT(logs) as total_requests; >> > > } >> > > >> > > D >> > > >> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected] >> > > >wrote: >> > > >> > > > I'm analyzing a daily apache log file. I'd like to get the number of >> > > > requests and of visits by hour. >> > > > >> > > > I managed to get the requests, but how do I get the visits? >> > > > >> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS >> > > (line:chararray); >> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE >> > > > FLATTEN( >> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+) >> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) >> > (\\+\\d{4})\\] >> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)') >> > > > ) AS ( >> > > > client: chararray, >> > > > username: chararray, >> > > > date: chararray, >> > > > hour: chararray, >> > > > minute: chararray, >> > > > second: chararray, >> > > > timeZone: chararray, >> > > > request: chararray, >> > > > statusCode: int, >> > > > bytesSent: chararray, >> > > > referer: chararray, >> > > > userAgent: chararray, >> > > > remoteUser: chararray, >> > > > timeTaken: chararray >> > > > ); >> > > > grunt> A = GROUP LOGS_BASE BY hour; >> > > > DESCRIBE A; >> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username: >> > > > chararray,date: chararray,hour: chararray,minute: chararray,second: >> > > > chararray,timeZone: chararray,request: chararray,statusCode: >> > > int,bytesSent: >> > > > chararray,referer: chararray,userAgent: chararray,remoteUser: >> > > > chararray,timeTaken: chararray)}} >> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 ); >> > > > grunt> C = ORDER B BY hour; -- requests by hour >> > > > >> > > > How can I now get the distinct count of clients per hour? >> > > > >> > > > Thanks for your help! >> > > > >> > > > -- >> > > > David Riccitelli >> > > > >> > > > >> > > > >> > > >> > >> ******************************************************************************** >> > > > InsideOut10 s.r.l. >> > > > P.IVA: IT-11381771002 >> > > > Fax: +39 0110708239 >> > > > --- >> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >> > > > Twitter: ziodave >> > > > --- >> > > > Layar Partner Network< >> > > > >> > > >> > >> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >> > > > > >> > > > >> > > > >> > > >> > >> ******************************************************************************** >> > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > David Riccitelli >> > > > >> > > > >> > > > >> > > >> > >> ******************************************************************************** >> > > > InsideOut10 s.r.l. >> > > > P.IVA: IT-11381771002 >> > > > Fax: +39 0110708239 >> > > > --- >> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >> > > > Twitter: ziodave >> > > > --- >> > > > Layar Partner Network< >> > > > >> > > >> > >> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >> > > > > >> > > > >> > > > >> > > >> > >> ******************************************************************************** >> > > > >> > > >> > >> > >> > >> > -- >> > David Riccitelli >> > >> > >> > >> ******************************************************************************** >> > InsideOut10 s.r.l. >> > P.IVA: IT-11381771002 >> > Fax: +39 0110708239 >> > --- >> > LinkedIn: http://it.linkedin.com/in/riccitelli >> > Twitter: ziodave >> > --- >> > Layar Partner Network< >> > >> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >> > > >> > >> > >> ******************************************************************************** >> > >> > > > > -- > David Riccitelli > > > ******************************************************************************** > InsideOut10 s.r.l. > P.IVA: IT-11381771002 > Fax: +39 0110708239 > --- > LinkedIn: http://it.linkedin.com/in/riccitelli > Twitter: ziodave > --- > Layar Partner > Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> > > ******************************************************************************** > > -- David Riccitelli ******************************************************************************** InsideOut10 s.r.l. P.IVA: IT-11381771002 Fax: +39 0110708239 --- LinkedIn: http://it.linkedin.com/in/riccitelli Twitter: ziodave --- Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> ********************************************************************************
