Try just using MAX (no piggybank). If that doesn't work, try FloatMax
(still no piggybank).


On Tue, Dec 20, 2011 at 2:19 PM, Grig Gheorghiu
<[email protected]> wrote:
> Hello,
>
> Noob here. I am trying to analyze some Nginx log files and get some
> aggregate stats based on date and URL. Here is the beginning of a Pig
> script I have (I am running this in Elastic MapReduce, with Pig
> 0.9.1):
>
>
> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
> DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX();
> RAW_LOGS = LOAD '$INPUT' as (line:chararray);
> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>
> FLATTEN(
>        EXTRACT(line, '(\\S+) - -
> \\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+)
> ([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)')
> )
> AS (
>        ip: chararray,
>        day: chararray,
>        month: chararray,
>        year: chararray,
>        hour: chararray,
>        minute: chararray,
>        second: chararray,
>        tzoffset: chararray,
>        url: chararray,
>        status: chararray,
>        bytes: chararray,
>        referrer: chararray,
>        useragent: chararray,
>        xfwd: chararray,
>        reqtime: float
> );
> DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url, reqtime;
> FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL;
> GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day, url);
>
> Now I would like to get the MAX of the request time. If I do this:
>
> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
> (year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as maxreqtime;
> STORE AGGREGATES INTO '$OUTPUT';
>
> I get this error
>
> 2011-12-20 22:16:32,147 [main] ERROR org.apache.pig.tools.grunt.Grunt
> - ERROR 1045:
> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
> 91> Could not infer the matching function for
> org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of
> them fit. Please use an explicit cast.
>
> If I do this cast:
>
> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
> (year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as
> maxreqtime;
>
> I get another error:
>
> 2011-12-20 22:18:35,115 [main] ERROR org.apache.pig.tools.grunt.Grunt
> - ERROR 1052:
> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
> 96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to float
>
> Any help would be greatly appreciated.
>
> Grig

Reply via email to