Hello,

Noob here. I am trying to analyze some Nginx log files and get some
aggregate stats based on date and URL. Here is the beginning of a Pig
script I have (I am running this in Elastic MapReduce, with Pig
0.9.1):


REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX();
RAW_LOGS = LOAD '$INPUT' as (line:chararray);
LOGS_BASE = FOREACH RAW_LOGS GENERATE

FLATTEN(
        EXTRACT(line, '(\\S+) - -
\\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+)
([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)')
)
AS (
        ip: chararray,
        day: chararray,
        month: chararray,
        year: chararray,
        hour: chararray,
        minute: chararray,
        second: chararray,
        tzoffset: chararray,
        url: chararray,
        status: chararray,
        bytes: chararray,
        referrer: chararray,
        useragent: chararray,
        xfwd: chararray,
        reqtime: float
);
DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url, reqtime;
FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL;
GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day, url);

Now I would like to get the MAX of the request time. If I do this:

AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
(year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as maxreqtime;
STORE AGGREGATES INTO '$OUTPUT';

I get this error

2011-12-20 22:16:32,147 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1045:
<file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
91> Could not infer the matching function for
org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of
them fit. Please use an explicit cast.

If I do this cast:

AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
(year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as
maxreqtime;

I get another error:

2011-12-20 22:18:35,115 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1052:
<file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to float

Any help would be greatly appreciated.

Grig

Reply via email to