Hello,
Noob here. I am trying to analyze some Nginx log files and get some
aggregate stats based on date and URL. Here is the beginning of a Pig
script I have (I am running this in Elastic MapReduce, with Pig
0.9.1):
REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX();
RAW_LOGS = LOAD '$INPUT' as (line:chararray);
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
EXTRACT(line, '(\\S+) - -
\\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+)
([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)')
)
AS (
ip: chararray,
day: chararray,
month: chararray,
year: chararray,
hour: chararray,
minute: chararray,
second: chararray,
tzoffset: chararray,
url: chararray,
status: chararray,
bytes: chararray,
referrer: chararray,
useragent: chararray,
xfwd: chararray,
reqtime: float
);
DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url, reqtime;
FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL;
GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day, url);
Now I would like to get the MAX of the request time. If I do this:
AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
(year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as maxreqtime;
STORE AGGREGATES INTO '$OUTPUT';
I get this error
2011-12-20 22:16:32,147 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1045:
<file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
91> Could not infer the matching function for
org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of
them fit. Please use an explicit cast.
If I do this cast:
AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
(year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as
maxreqtime;
I get another error:
2011-12-20 22:18:35,115 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1052:
<file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to float
Any help would be greatly appreciated.
Grig