Re: Problems with aggregate functions

Dmitriy Ryaboy Tue, 20 Dec 2011 14:45:27 -0800

Can you take a look at the failed map tasks and paste their stack traces?


On Tue, Dec 20, 2011 at 2:40 PM, Grig Gheorghiu
<[email protected]> wrote:
> Thanks Dmitriy! I made some progress by using MAX without the
> piggybank declaration, but I still got an error:
>
> 2011-12-20 22:35:25,832 [main] INFO  org.apache.pig.Main - Logging
> error messages to: /home/hadoop/pigscripts/pig_1324420525827.log
> 2011-12-20 22:35:26,139 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://10.110.209.25:9000
> 2011-12-20 22:35:26,615 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at: 10.110.209.25:9001
> 2011-12-20 22:35:27,815 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: GROUP_BY,FILTER
> 2011-12-20 22:35:29,109 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - File concatenation threshold: 100 optimistic? false
> 2011-12-20 22:35:29,121 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
> - Choosing to move algebraic foreach to combiner
> 2011-12-20 22:35:29,148 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2011-12-20 22:35:29,148 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> 2011-12-20 22:35:29,293 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added to the job
> 2011-12-20 22:35:29,308 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to
> default 0.3
> 2011-12-20 22:35:37,231 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2011-12-20 22:35:37,276 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0
> 2011-12-20 22:35:37,276 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Neither PARALLEL nor default parallelism is set for this job.
> Setting number of reducers to 1
> 2011-12-20 22:35:37,328 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2011-12-20 22:35:37,345 [Thread-5] INFO
> org.apache.hadoop.mapred.JobClient - Default number of map tasks: null
> 2011-12-20 22:35:37,345 [Thread-5] INFO
> org.apache.hadoop.mapred.JobClient - Setting default number of map
> tasks based on cluster size to : 32
> 2011-12-20 22:35:37,346 [Thread-5] INFO
> org.apache.hadoop.mapred.JobClient - Default number of reduce tasks: 1
> 2011-12-20 22:35:37,829 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2011-12-20 22:35:38,083 [Thread-5] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1
> 2011-12-20 22:35:38,083 [Thread-5] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> 2011-12-20 22:35:38,093 [Thread-5] INFO
> com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl
> library
> 2011-12-20 22:35:38,095 [Thread-5] WARN
> com.hadoop.compression.lzo.LzoCodec - Could not find build properties
> file with revision hash
> 2011-12-20 22:35:38,095 [Thread-5] INFO
> com.hadoop.compression.lzo.LzoCodec - Successfully loaded &
> initialized native-lzo library [hadoop-lzo rev UNKNOWN]
> 2011-12-20 22:35:38,102 [Thread-5] WARN
> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
> library is available
> 2011-12-20 22:35:38,102 [Thread-5] INFO
> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
> library loaded
> 2011-12-20 22:35:38,105 [Thread-5] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2011-12-20 22:35:38,921 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_201112192006_0026
> 2011-12-20 22:35:38,922 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - More information at:
> http://10.110.209.25:9100/jobdetails.jsp?jobid=job_201112192006_0026
> 2011-12-20 22:36:28,616 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - job job_201112192006_0026 has failed! Stop running all dependent
> jobs
> 2011-12-20 22:36:28,616 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2011-12-20 22:36:28,633 [main] ERROR
> org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
> recreate exception from backed error:
> org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
> Error while computing max in Initial
> 2011-12-20 22:36:28,633 [main] ERROR
> org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s)
> failed!
> 2011-12-20 22:36:28,635 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>
> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      
> Features
> 0.20.205        0.9.1-amzn      hadoop  2011-12-20 22:35:29     2011-12-20
> 22:36:28        GROUP_BY,FILTER
>
> Failed!
>
> Failed Jobs:
> JobId   Alias   Feature Message Outputs
> job_201112192006_0026   
> AGGREGATES,DATE_URL,FILTERED_DATE_URL,GROUP_BY_DATE_URL,LOGS_BASE,RAW_LOGS    
>   GROUP_BY,COMBINER       Message:
> Job failed! Error - # of failed Map Tasks exceeded allowed limit.
> FailedCount: 1. LastFailedTask:
> task_201112192006_0026_m_000000 
> s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235,
>
> Input(s):
> Failed to read data from "s3://mapreduce.bucket/nginx/*test*.gz"
>
> Output(s):
> Failed to produce result in
> "s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235"
>
> Counters:
> Total records written : 0
> Total bytes written : 0
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_201112192006_0026
>
>
> 2011-12-20 22:36:28,635 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Failed!
> 2011-12-20 22:36:29,181 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
> recreate exception from backed error:
> org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
> Error while computing max in Initial
> Details at logfile: /home/hadoop/pigscripts/pig_1324420525827.log
>
> On Tue, Dec 20, 2011 at 2:32 PM, Dmitriy Ryaboy <[email protected]> wrote:
>> Try just using MAX (no piggybank). If that doesn't work, try FloatMax
>> (still no piggybank).
>>
>>
>> On Tue, Dec 20, 2011 at 2:19 PM, Grig Gheorghiu
>> <[email protected]> wrote:
>>> Hello,
>>>
>>> Noob here. I am trying to analyze some Nginx log files and get some
>>> aggregate stats based on date and URL. Here is the beginning of a Pig
>>> script I have (I am running this in Elastic MapReduce, with Pig
>>> 0.9.1):
>>>
>>>
>>> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
>>> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
>>> DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX();
>>> RAW_LOGS = LOAD '$INPUT' as (line:chararray);
>>> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>
>>> FLATTEN(
>>>        EXTRACT(line, '(\\S+) - -
>>> \\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+)
>>> ([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)')
>>> )
>>> AS (
>>>        ip: chararray,
>>>        day: chararray,
>>>        month: chararray,
>>>        year: chararray,
>>>        hour: chararray,
>>>        minute: chararray,
>>>        second: chararray,
>>>        tzoffset: chararray,
>>>        url: chararray,
>>>        status: chararray,
>>>        bytes: chararray,
>>>        referrer: chararray,
>>>        useragent: chararray,
>>>        xfwd: chararray,
>>>        reqtime: float
>>> );
>>> DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url, reqtime;
>>> FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL;
>>> GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day, url);
>>>
>>> Now I would like to get the MAX of the request time. If I do this:
>>>
>>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
>>> (year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as maxreqtime;
>>> STORE AGGREGATES INTO '$OUTPUT';
>>>
>>> I get this error
>>>
>>> 2011-12-20 22:16:32,147 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>> - ERROR 1045:
>>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
>>> 91> Could not infer the matching function for
>>> org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of
>>> them fit. Please use an explicit cast.
>>>
>>> If I do this cast:
>>>
>>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
>>> (year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as
>>> maxreqtime;
>>>
>>> I get another error:
>>>
>>> 2011-12-20 22:18:35,115 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>> - ERROR 1052:
>>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
>>> 96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to float
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Grig

Re: Problems with aggregate functions

Reply via email to