Sounds like when the regex returns matched strings, even though you say that the time should be a float, it's actually a String.
Try foo = foreach LOGS_BASE generate field1, field2, (float) reqtime; then group / max on foo. D On Tue, Dec 20, 2011 at 3:58 PM, Grig Gheorghiu <[email protected]>wrote: > Here it is > > Backend error message > --------------------- > org.apache.pig.backend.executionengine.ExecException: ERROR 2106: > Error while computing max in Initial > at org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:84) > at org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:64) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > Caused by: java.lang.ClassCastException: java.lang.String cannot be > cast to java.lang.Float > at org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:76) > ... 19 more > > Pig Stack Trace > --------------- > ERROR 2997: Unable to recreate exception from backed error: > org.apache.pig.backend.executionengine.ExecException: ERROR 2106: > Error while computing max in Initial > > org.apache.pig.backend.executionengine.ExecException: ERROR 2997: > Unable to recreate exception from backed error: > org.apache.pig.backend.executionengine.ExecException: ERROR 2106: > Error while computing max in Initial > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:343) > at org.apache.pig.PigServer.launchPlan(PigServer.java:1314) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1299) > at org.apache.pig.PigServer.execute(PigServer.java:1286) > > On Tue, Dec 20, 2011 at 2:44 PM, Dmitriy Ryaboy <[email protected]> > wrote: > > Can you take a look at the failed map tasks and paste their stack traces? > > > > On Tue, Dec 20, 2011 at 2:40 PM, Grig Gheorghiu > > <[email protected]> wrote: > >> Thanks Dmitriy! I made some progress by using MAX without the > >> piggybank declaration, but I still got an error: > >> > >> 2011-12-20 22:35:25,832 [main] INFO org.apache.pig.Main - Logging > >> error messages to: /home/hadoop/pigscripts/pig_1324420525827.log > >> 2011-12-20 22:35:26,139 [main] INFO > >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > >> Connecting to hadoop file system at: hdfs://10.110.209.25:9000 > >> 2011-12-20 22:35:26,615 [main] INFO > >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > >> Connecting to map-reduce job tracker at: 10.110.209.25:9001 > >> 2011-12-20 22:35:27,815 [main] INFO > >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > >> script: GROUP_BY,FILTER > >> 2011-12-20 22:35:29,109 [main] INFO > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler > >> - File concatenation threshold: 100 optimistic? false > >> 2011-12-20 22:35:29,121 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer > >> - Choosing to move algebraic foreach to combiner > >> 2011-12-20 22:35:29,148 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > >> - MR plan size before optimization: 1 > >> 2011-12-20 22:35:29,148 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > >> - MR plan size after optimization: 1 > >> 2011-12-20 22:35:29,293 [main] INFO > >> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are > >> added to the job > >> 2011-12-20 22:35:29,308 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > >> - mapred.job.reduce.markreset.buffer.percent is not set, set to > >> default 0.3 > >> 2011-12-20 22:35:37,231 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > >> - Setting up single store job > >> 2011-12-20 22:35:37,276 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > >> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0 > >> 2011-12-20 22:35:37,276 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > >> - Neither PARALLEL nor default parallelism is set for this job. > >> Setting number of reducers to 1 > >> 2011-12-20 22:35:37,328 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - 1 map-reduce job(s) waiting for submission. > >> 2011-12-20 22:35:37,345 [Thread-5] INFO > >> org.apache.hadoop.mapred.JobClient - Default number of map tasks: null > >> 2011-12-20 22:35:37,345 [Thread-5] INFO > >> org.apache.hadoop.mapred.JobClient - Setting default number of map > >> tasks based on cluster size to : 32 > >> 2011-12-20 22:35:37,346 [Thread-5] INFO > >> org.apache.hadoop.mapred.JobClient - Default number of reduce tasks: 1 > >> 2011-12-20 22:35:37,829 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - 0% complete > >> 2011-12-20 22:35:38,083 [Thread-5] INFO > >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input > >> paths to process : 1 > >> 2011-12-20 22:35:38,083 [Thread-5] INFO > >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > >> input paths to process : 1 > >> 2011-12-20 22:35:38,093 [Thread-5] INFO > >> com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl > >> library > >> 2011-12-20 22:35:38,095 [Thread-5] WARN > >> com.hadoop.compression.lzo.LzoCodec - Could not find build properties > >> file with revision hash > >> 2011-12-20 22:35:38,095 [Thread-5] INFO > >> com.hadoop.compression.lzo.LzoCodec - Successfully loaded & > >> initialized native-lzo library [hadoop-lzo rev UNKNOWN] > >> 2011-12-20 22:35:38,102 [Thread-5] WARN > >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native > >> library is available > >> 2011-12-20 22:35:38,102 [Thread-5] INFO > >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native > >> library loaded > >> 2011-12-20 22:35:38,105 [Thread-5] INFO > >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > >> input paths (combined) to process : 1 > >> 2011-12-20 22:35:38,921 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - HadoopJobId: job_201112192006_0026 > >> 2011-12-20 22:35:38,922 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - More information at: > >> http://10.110.209.25:9100/jobdetails.jsp?jobid=job_201112192006_0026 > >> 2011-12-20 22:36:28,616 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - job job_201112192006_0026 has failed! Stop running all dependent > >> jobs > >> 2011-12-20 22:36:28,616 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - 100% complete > >> 2011-12-20 22:36:28,633 [main] ERROR > >> org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to > >> recreate exception from backed error: > >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: > >> Error while computing max in Initial > >> 2011-12-20 22:36:28,633 [main] ERROR > >> org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) > >> failed! > >> 2011-12-20 22:36:28,635 [main] INFO > >> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: > >> > >> HadoopVersion PigVersion UserId StartedAt FinishedAt > Features > >> 0.20.205 0.9.1-amzn hadoop 2011-12-20 22:35:29 > 2011-12-20 > >> 22:36:28 GROUP_BY,FILTER > >> > >> Failed! > >> > >> Failed Jobs: > >> JobId Alias Feature Message Outputs > >> job_201112192006_0026 > AGGREGATES,DATE_URL,FILTERED_DATE_URL,GROUP_BY_DATE_URL,LOGS_BASE,RAW_LOGS > GROUP_BY,COMBINER Message: > >> Job failed! Error - # of failed Map Tasks exceeded allowed limit. > >> FailedCount: 1. LastFailedTask: > >> task_201112192006_0026_m_000000 > s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235, > >> > >> Input(s): > >> Failed to read data from "s3://mapreduce.bucket/nginx/*test*.gz" > >> > >> Output(s): > >> Failed to produce result in > >> "s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235" > >> > >> Counters: > >> Total records written : 0 > >> Total bytes written : 0 > >> Spillable Memory Manager spill count : 0 > >> Total bags proactively spilled: 0 > >> Total records proactively spilled: 0 > >> > >> Job DAG: > >> job_201112192006_0026 > >> > >> > >> 2011-12-20 22:36:28,635 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - Failed! > >> 2011-12-20 22:36:29,181 [main] ERROR > >> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to > >> recreate exception from backed error: > >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: > >> Error while computing max in Initial > >> Details at logfile: /home/hadoop/pigscripts/pig_1324420525827.log > >> > >> On Tue, Dec 20, 2011 at 2:32 PM, Dmitriy Ryaboy <[email protected]> > wrote: > >>> Try just using MAX (no piggybank). If that doesn't work, try FloatMax > >>> (still no piggybank). > >>> > >>> > >>> On Tue, Dec 20, 2011 at 2:19 PM, Grig Gheorghiu > >>> <[email protected]> wrote: > >>>> Hello, > >>>> > >>>> Noob here. I am trying to analyze some Nginx log files and get some > >>>> aggregate stats based on date and URL. Here is the beginning of a Pig > >>>> script I have (I am running this in Elastic MapReduce, with Pig > >>>> 0.9.1): > >>>> > >>>> > >>>> REGISTER file:/home/hadoop/lib/pig/piggybank.jar; > >>>> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); > >>>> DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX(); > >>>> RAW_LOGS = LOAD '$INPUT' as (line:chararray); > >>>> LOGS_BASE = FOREACH RAW_LOGS GENERATE > >>>> > >>>> FLATTEN( > >>>> EXTRACT(line, '(\\S+) - - > >>>> \\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+) > >>>> > ([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)') > >>>> ) > >>>> AS ( > >>>> ip: chararray, > >>>> day: chararray, > >>>> month: chararray, > >>>> year: chararray, > >>>> hour: chararray, > >>>> minute: chararray, > >>>> second: chararray, > >>>> tzoffset: chararray, > >>>> url: chararray, > >>>> status: chararray, > >>>> bytes: chararray, > >>>> referrer: chararray, > >>>> useragent: chararray, > >>>> xfwd: chararray, > >>>> reqtime: float > >>>> ); > >>>> DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url, reqtime; > >>>> FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL; > >>>> GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day, > url); > >>>> > >>>> Now I would like to get the MAX of the request time. If I do this: > >>>> > >>>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as > >>>> (year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as maxreqtime; > >>>> STORE AGGREGATES INTO '$OUTPUT'; > >>>> > >>>> I get this error > >>>> > >>>> 2011-12-20 22:16:32,147 [main] ERROR org.apache.pig.tools.grunt.Grunt > >>>> - ERROR 1045: > >>>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column > >>>> 91> Could not infer the matching function for > >>>> org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of > >>>> them fit. Please use an explicit cast. > >>>> > >>>> If I do this cast: > >>>> > >>>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as > >>>> (year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as > >>>> maxreqtime; > >>>> > >>>> I get another error: > >>>> > >>>> 2011-12-20 22:18:35,115 [main] ERROR org.apache.pig.tools.grunt.Grunt > >>>> - ERROR 1052: > >>>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column > >>>> 96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to float > >>>> > >>>> Any help would be greatly appreciated. > >>>> > >>>> Grig >
