Try just using MAX (no piggybank). If that doesn't work, try FloatMax (still no piggybank).
On Tue, Dec 20, 2011 at 2:19 PM, Grig Gheorghiu <[email protected]> wrote: > Hello, > > Noob here. I am trying to analyze some Nginx log files and get some > aggregate stats based on date and URL. Here is the beginning of a Pig > script I have (I am running this in Elastic MapReduce, with Pig > 0.9.1): > > > REGISTER file:/home/hadoop/lib/pig/piggybank.jar; > DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); > DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX(); > RAW_LOGS = LOAD '$INPUT' as (line:chararray); > LOGS_BASE = FOREACH RAW_LOGS GENERATE > > FLATTEN( > EXTRACT(line, '(\\S+) - - > \\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+) > ([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)') > ) > AS ( > ip: chararray, > day: chararray, > month: chararray, > year: chararray, > hour: chararray, > minute: chararray, > second: chararray, > tzoffset: chararray, > url: chararray, > status: chararray, > bytes: chararray, > referrer: chararray, > useragent: chararray, > xfwd: chararray, > reqtime: float > ); > DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url, reqtime; > FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL; > GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day, url); > > Now I would like to get the MAX of the request time. If I do this: > > AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as > (year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as maxreqtime; > STORE AGGREGATES INTO '$OUTPUT'; > > I get this error > > 2011-12-20 22:16:32,147 [main] ERROR org.apache.pig.tools.grunt.Grunt > - ERROR 1045: > <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column > 91> Could not infer the matching function for > org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of > them fit. Please use an explicit cast. > > If I do this cast: > > AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as > (year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as > maxreqtime; > > I get another error: > > 2011-12-20 22:18:35,115 [main] ERROR org.apache.pig.tools.grunt.Grunt > - ERROR 1052: > <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column > 96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to float > > Any help would be greatly appreciated. > > Grig
