Sorry for this long sequence of messages, but I'm posting things as I continue testing/investigating.
May be this relevant to my case? http://www.mail-archive.com/[email protected]/msg02258.html Thanks, David On Fri, Aug 19, 2011 at 7:31 PM, David Riccitelli <[email protected]>wrote: > I tried changing this line, from: > RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' > USING TextLoader() AS (line:chararray); > > to: > RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' > USING PigStorage() AS (line:chararray); > > It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL > that produces the logs schema. > > Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX > function? > > Thanks for your help, > David > > On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <[email protected]>wrote: > >> I noticed that this issue arises only if I load the initial data with the >> TextLoader() and using the REGEX_EXTRACT_ALL. >> >> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o >> REGEX_EXTRACT_ALL), it works. >> >> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines... >> >> Does it make sense? >> >> David >> >> >> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <[email protected]>wrote: >> >>> I still can't manage to accomplish my objectives. I'm trying to get now >>> the max time taken so, as a test, I do: >>> grunt> A = GROUP logs BY client; >>> >>> then (timeTaken is long): >>> B = FOREACH A GENERATE group, MAX( logs.timeTaken ); >>> >>> when I dump it, I get the following error: >>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error >>> while computing max in Initial >>> (...) >>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast >>> to java.lang.Long >>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76) >>> >>> Initially I thought that I had some timeTaken not compatible with long >>> data type, but I checked and re-checked. I also get the timeTaken as \d+ >>> regular expression. >>> >>> What am I doing wrong? >>> >>> Thanks! >>> David >>> >>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <[email protected]>wrote: >>> >>>> I tried with another log file and that does not happen, so I suppose >>>> there's some 'corrupted' line in the one I was testing. >>>> >>>> >>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli >>>> <[email protected]>wrote: >>>> >>>>> There's something strage in the results however: >>>>> (00,129,30096) >>>>> (01,91,16487) >>>>> (02,57,11686) >>>>> (03,41,6041) >>>>> (04,30,4882) >>>>> (05,33,4154) >>>>> (06,65,8031) >>>>> (07,66,12260) >>>>> (08,95,17924) >>>>> (09,131,21187) >>>>> (10,162,26607) >>>>> (11,155,28503) >>>>> (12,146,27863) >>>>> (13,152,29130) >>>>> (14,159,32784) >>>>> (15,150,28898) >>>>> (16,143,28973) >>>>> (17,169,29024) >>>>> (18,199,26585) >>>>> (19,182,28803) >>>>> (20,224,32511) >>>>> (21,232,38584) >>>>> (22,225,39924) >>>>> (23,191,33606) >>>>> (,0,0) >>>>> >>>>> >>>>> What is the last line: >>>>> (,0,0) >>>>> the count is zero, it shouldn't really be there, correct? >>>>> >>>>> (Using pig 0.9.0) >>>>> >>>>> >>>>> Thanks, >>>>> David >>>>> >>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote: >>>>> >>>>>> Right, that should read "by_hour_client.num_reqs". >>>>>> >>>>>> Don't trust relative measurements you get for small data on a single >>>>>> computer in local mode. Things change when you start running on >>>>>> hundreds of >>>>>> gigs with real skew on a cluster. >>>>>> >>>>>> D >>>>>> >>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected] >>>>>> >wrote: >>>>>> >>>>>> > Thanks Dmitriy, >>>>>> > >>>>>> > The second method took less than 26 secs. on my computer (~550.000 >>>>>> lines). >>>>>> > The first method is giving me the following error: >>>>>> > >>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: >>>>>> > <line 34, column 7> Invalid field projection. Projected field >>>>>> [num_reqs] >>>>>> > does not exist in schema: >>>>>> > >>>>>> > >>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}. >>>>>> > >>>>>> > when I try to set the by_hour (after having set the by_hour_client): >>>>>> > >>>>>> > grunt> by_hour_client = >>>>>> > >> foreach >>>>>> > >> (group logs by (hour, client)) >>>>>> > >> generate >>>>>> > >> flatten(group) as (hour, client), >>>>>> > >> COUNT(logs) as num_reqs; >>>>>> > grunt> by_hour = >>>>>> > >> foreach >>>>>> > >> (group by_hour_client by hour) >>>>>> > >> generate >>>>>> > >> group as hour, >>>>>> > >> COUNT(by_hour_client) as num_dist_clients, >>>>>> > >> SUM(num_reqs) as total_requests; >>>>>> > >>>>>> > If I understood correctly that's because the num_reqs is in the bag, >>>>>> as a >>>>>> > result of the >>>>>> > * (group by_hour_client by hour)* >>>>>> > correct? So I changed the last line to >>>>>> > *SUM(by_hour_client.num_reqs) as total_requests;* >>>>>> > and it worked (it took a little more than 29 seconds). >>>>>> > >>>>>> > Thanks for your help, >>>>>> > David >>>>>> > >>>>>> > >>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected] >>>>>> > >>>>>> > wrote: >>>>>> > >>>>>> > > by_hour_client = >>>>>> > > foreach >>>>>> > > (group logs by (hour, client) parallel $p) >>>>>> > > generate >>>>>> > > flatten(group) as (hour, client), >>>>>> > > COUNT(logs) as num_reqs; >>>>>> > > >>>>>> > > by_hour = >>>>>> > > foreach >>>>>> > > (group by_hour_client by hour parallel $p2) >>>>>> > > generate >>>>>> > > group as hour, >>>>>> > > COUNT(by_hour_client) as num_dist_clients, >>>>>> > > SUM(num_reqs) as total_requests; >>>>>> > > >>>>>> > > You can also do this using a nested distinct, but depending on >>>>>> what your >>>>>> > > data looks like, it might be a bad idea, as it can put a lot of >>>>>> pressure >>>>>> > on >>>>>> > > individual reducers that have to do the inner distinct in memory >>>>>> > (although >>>>>> > > they do push part of this up to the mappers): >>>>>> > > >>>>>> > > by_hour = >>>>>> > > foreach (group logs by hour) { >>>>>> > > dist_clients = distinct logs.client; >>>>>> > > generate >>>>>> > > group as hour, >>>>>> > > COUNT(dist_clients) as num_dist_clients, >>>>>> > > COUNT(logs) as total_requests; >>>>>> > > } >>>>>> > > >>>>>> > > D >>>>>> > > >>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli < >>>>>> [email protected] >>>>>> > > >wrote: >>>>>> > > >>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the >>>>>> number of >>>>>> > > > requests and of visits by hour. >>>>>> > > > >>>>>> > > > I managed to get the requests, but how do I get the visits? >>>>>> > > > >>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS >>>>>> > > (line:chararray); >>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE >>>>>> > > > FLATTEN( >>>>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+) >>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) >>>>>> > (\\+\\d{4})\\] >>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)') >>>>>> > > > ) AS ( >>>>>> > > > client: chararray, >>>>>> > > > username: chararray, >>>>>> > > > date: chararray, >>>>>> > > > hour: chararray, >>>>>> > > > minute: chararray, >>>>>> > > > second: chararray, >>>>>> > > > timeZone: chararray, >>>>>> > > > request: chararray, >>>>>> > > > statusCode: int, >>>>>> > > > bytesSent: chararray, >>>>>> > > > referer: chararray, >>>>>> > > > userAgent: chararray, >>>>>> > > > remoteUser: chararray, >>>>>> > > > timeTaken: chararray >>>>>> > > > ); >>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour; >>>>>> > > > DESCRIBE A; >>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username: >>>>>> > > > chararray,date: chararray,hour: chararray,minute: >>>>>> chararray,second: >>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode: >>>>>> > > int,bytesSent: >>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser: >>>>>> > > > chararray,timeTaken: chararray)}} >>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 ); >>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour >>>>>> > > > >>>>>> > > > How can I now get the distinct count of clients per hour? >>>>>> > > > >>>>>> > > > Thanks for your help! >>>>>> > > > >>>>>> > > > -- >>>>>> > > > David Riccitelli >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> ******************************************************************************** >>>>>> > > > InsideOut10 s.r.l. >>>>>> > > > P.IVA: IT-11381771002 >>>>>> > > > Fax: +39 0110708239 >>>>>> > > > --- >>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >>>>>> > > > Twitter: ziodave >>>>>> > > > --- >>>>>> > > > Layar Partner Network< >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>>>> > > > > >>>>>> > > > >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> ******************************************************************************** >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > -- >>>>>> > > > David Riccitelli >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> ******************************************************************************** >>>>>> > > > InsideOut10 s.r.l. >>>>>> > > > P.IVA: IT-11381771002 >>>>>> > > > Fax: +39 0110708239 >>>>>> > > > --- >>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >>>>>> > > > Twitter: ziodave >>>>>> > > > --- >>>>>> > > > Layar Partner Network< >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>>>> > > > > >>>>>> > > > >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> ******************************************************************************** >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > David Riccitelli >>>>>> > >>>>>> > >>>>>> > >>>>>> ******************************************************************************** >>>>>> > InsideOut10 s.r.l. >>>>>> > P.IVA: IT-11381771002 >>>>>> > Fax: +39 0110708239 >>>>>> > --- >>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli >>>>>> > Twitter: ziodave >>>>>> > --- >>>>>> > Layar Partner Network< >>>>>> > >>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>>>> > > >>>>>> > >>>>>> > >>>>>> ******************************************************************************** >>>>>> > >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> David Riccitelli >>>>> >>>>> >>>>> ******************************************************************************** >>>>> InsideOut10 s.r.l. >>>>> P.IVA: IT-11381771002 >>>>> Fax: +39 0110708239 >>>>> --- >>>>> LinkedIn: http://it.linkedin.com/in/riccitelli >>>>> Twitter: ziodave >>>>> --- >>>>> Layar Partner >>>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >>>>> >>>>> ******************************************************************************** >>>>> >>>>> >>>> >>>> >>>> -- >>>> David Riccitelli >>>> >>>> >>>> ******************************************************************************** >>>> InsideOut10 s.r.l. >>>> P.IVA: IT-11381771002 >>>> Fax: +39 0110708239 >>>> --- >>>> LinkedIn: http://it.linkedin.com/in/riccitelli >>>> Twitter: ziodave >>>> --- >>>> Layar Partner >>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >>>> >>>> ******************************************************************************** >>>> >>>> >>> >>> >>> -- >>> David Riccitelli >>> >>> >>> ******************************************************************************** >>> InsideOut10 s.r.l. >>> P.IVA: IT-11381771002 >>> Fax: +39 0110708239 >>> --- >>> LinkedIn: http://it.linkedin.com/in/riccitelli >>> Twitter: ziodave >>> --- >>> Layar Partner >>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >>> >>> ******************************************************************************** >>> >>> >> >> >> -- >> David Riccitelli >> >> >> ******************************************************************************** >> InsideOut10 s.r.l. >> P.IVA: IT-11381771002 >> Fax: +39 0110708239 >> --- >> LinkedIn: http://it.linkedin.com/in/riccitelli >> Twitter: ziodave >> --- >> Layar Partner >> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >> >> ******************************************************************************** >> >> > > > -- > David Riccitelli > > > ******************************************************************************** > InsideOut10 s.r.l. > P.IVA: IT-11381771002 > Fax: +39 0110708239 > --- > LinkedIn: http://it.linkedin.com/in/riccitelli > Twitter: ziodave > --- > Layar Partner > Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> > > ******************************************************************************** > > -- David Riccitelli ******************************************************************************** InsideOut10 s.r.l. P.IVA: IT-11381771002 Fax: +39 0110708239 --- LinkedIn: http://it.linkedin.com/in/riccitelli Twitter: ziodave --- Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> ********************************************************************************
