Can you take a look at the failed map tasks and paste their stack traces?
On Tue, Dec 20, 2011 at 2:40 PM, Grig Gheorghiu <[email protected]> wrote: > Thanks Dmitriy! I made some progress by using MAX without the > piggybank declaration, but I still got an error: > > 2011-12-20 22:35:25,832 [main] INFO org.apache.pig.Main - Logging > error messages to: /home/hadoop/pigscripts/pig_1324420525827.log > 2011-12-20 22:35:26,139 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to hadoop file system at: hdfs://10.110.209.25:9000 > 2011-12-20 22:35:26,615 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to map-reduce job tracker at: 10.110.209.25:9001 > 2011-12-20 22:35:27,815 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > script: GROUP_BY,FILTER > 2011-12-20 22:35:29,109 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler > - File concatenation threshold: 100 optimistic? false > 2011-12-20 22:35:29,121 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer > - Choosing to move algebraic foreach to combiner > 2011-12-20 22:35:29,148 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2011-12-20 22:35:29,148 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2011-12-20 22:35:29,293 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are > added to the job > 2011-12-20 22:35:29,308 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - mapred.job.reduce.markreset.buffer.percent is not set, set to > default 0.3 > 2011-12-20 22:35:37,231 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2011-12-20 22:35:37,276 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0 > 2011-12-20 22:35:37,276 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Neither PARALLEL nor default parallelism is set for this job. > Setting number of reducers to 1 > 2011-12-20 22:35:37,328 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 1 map-reduce job(s) waiting for submission. > 2011-12-20 22:35:37,345 [Thread-5] INFO > org.apache.hadoop.mapred.JobClient - Default number of map tasks: null > 2011-12-20 22:35:37,345 [Thread-5] INFO > org.apache.hadoop.mapred.JobClient - Setting default number of map > tasks based on cluster size to : 32 > 2011-12-20 22:35:37,346 [Thread-5] INFO > org.apache.hadoop.mapred.JobClient - Default number of reduce tasks: 1 > 2011-12-20 22:35:37,829 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2011-12-20 22:35:38,083 [Thread-5] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input > paths to process : 1 > 2011-12-20 22:35:38,083 [Thread-5] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input paths to process : 1 > 2011-12-20 22:35:38,093 [Thread-5] INFO > com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl > library > 2011-12-20 22:35:38,095 [Thread-5] WARN > com.hadoop.compression.lzo.LzoCodec - Could not find build properties > file with revision hash > 2011-12-20 22:35:38,095 [Thread-5] INFO > com.hadoop.compression.lzo.LzoCodec - Successfully loaded & > initialized native-lzo library [hadoop-lzo rev UNKNOWN] > 2011-12-20 22:35:38,102 [Thread-5] WARN > org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native > library is available > 2011-12-20 22:35:38,102 [Thread-5] INFO > org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native > library loaded > 2011-12-20 22:35:38,105 [Thread-5] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input paths (combined) to process : 1 > 2011-12-20 22:35:38,921 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - HadoopJobId: job_201112192006_0026 > 2011-12-20 22:35:38,922 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - More information at: > http://10.110.209.25:9100/jobdetails.jsp?jobid=job_201112192006_0026 > 2011-12-20 22:36:28,616 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - job job_201112192006_0026 has failed! Stop running all dependent > jobs > 2011-12-20 22:36:28,616 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2011-12-20 22:36:28,633 [main] ERROR > org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to > recreate exception from backed error: > org.apache.pig.backend.executionengine.ExecException: ERROR 2106: > Error while computing max in Initial > 2011-12-20 22:36:28,633 [main] ERROR > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) > failed! > 2011-12-20 22:36:28,635 [main] INFO > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: > > HadoopVersion PigVersion UserId StartedAt FinishedAt > Features > 0.20.205 0.9.1-amzn hadoop 2011-12-20 22:35:29 2011-12-20 > 22:36:28 GROUP_BY,FILTER > > Failed! > > Failed Jobs: > JobId Alias Feature Message Outputs > job_201112192006_0026 > AGGREGATES,DATE_URL,FILTERED_DATE_URL,GROUP_BY_DATE_URL,LOGS_BASE,RAW_LOGS > GROUP_BY,COMBINER Message: > Job failed! Error - # of failed Map Tasks exceeded allowed limit. > FailedCount: 1. LastFailedTask: > task_201112192006_0026_m_000000 > s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235, > > Input(s): > Failed to read data from "s3://mapreduce.bucket/nginx/*test*.gz" > > Output(s): > Failed to produce result in > "s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235" > > Counters: > Total records written : 0 > Total bytes written : 0 > Spillable Memory Manager spill count : 0 > Total bags proactively spilled: 0 > Total records proactively spilled: 0 > > Job DAG: > job_201112192006_0026 > > > 2011-12-20 22:36:28,635 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Failed! > 2011-12-20 22:36:29,181 [main] ERROR > org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to > recreate exception from backed error: > org.apache.pig.backend.executionengine.ExecException: ERROR 2106: > Error while computing max in Initial > Details at logfile: /home/hadoop/pigscripts/pig_1324420525827.log > > On Tue, Dec 20, 2011 at 2:32 PM, Dmitriy Ryaboy <[email protected]> wrote: >> Try just using MAX (no piggybank). If that doesn't work, try FloatMax >> (still no piggybank). >> >> >> On Tue, Dec 20, 2011 at 2:19 PM, Grig Gheorghiu >> <[email protected]> wrote: >>> Hello, >>> >>> Noob here. I am trying to analyze some Nginx log files and get some >>> aggregate stats based on date and URL. Here is the beginning of a Pig >>> script I have (I am running this in Elastic MapReduce, with Pig >>> 0.9.1): >>> >>> >>> REGISTER file:/home/hadoop/lib/pig/piggybank.jar; >>> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); >>> DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX(); >>> RAW_LOGS = LOAD '$INPUT' as (line:chararray); >>> LOGS_BASE = FOREACH RAW_LOGS GENERATE >>> >>> FLATTEN( >>> EXTRACT(line, '(\\S+) - - >>> \\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+) >>> ([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)') >>> ) >>> AS ( >>> ip: chararray, >>> day: chararray, >>> month: chararray, >>> year: chararray, >>> hour: chararray, >>> minute: chararray, >>> second: chararray, >>> tzoffset: chararray, >>> url: chararray, >>> status: chararray, >>> bytes: chararray, >>> referrer: chararray, >>> useragent: chararray, >>> xfwd: chararray, >>> reqtime: float >>> ); >>> DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url, reqtime; >>> FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL; >>> GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day, url); >>> >>> Now I would like to get the MAX of the request time. If I do this: >>> >>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as >>> (year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as maxreqtime; >>> STORE AGGREGATES INTO '$OUTPUT'; >>> >>> I get this error >>> >>> 2011-12-20 22:16:32,147 [main] ERROR org.apache.pig.tools.grunt.Grunt >>> - ERROR 1045: >>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column >>> 91> Could not infer the matching function for >>> org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of >>> them fit. Please use an explicit cast. >>> >>> If I do this cast: >>> >>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as >>> (year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as >>> maxreqtime; >>> >>> I get another error: >>> >>> 2011-12-20 22:18:35,115 [main] ERROR org.apache.pig.tools.grunt.Grunt >>> - ERROR 1052: >>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column >>> 96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to float >>> >>> Any help would be greatly appreciated. >>> >>> Grig
