Re: Problems with aggregate functions

Jonathan Coveney Tue, 20 Dec 2011 17:25:51 -0800

Grig, just for context on Dmitriy's help, doing :type does not, generally,
actually coerce the type over. There is a JIRA about the behavior when
using :float, for example, vs. (float), but in cases where a UDF, for
example, returns a chararray and you need to convert it, the safest bet is
to explicitly cast it.


2011/12/20 Grig Gheorghiu <[email protected]>

> That made it work! Thanks so much, Dmitriy, you rock!!!
>
> Grig
>
> On Tue, Dec 20, 2011 at 4:11 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
> > Sounds like when the regex returns matched strings, even though you say
> > that the time should be a float, it's actually a String.
> >
> > Try
> >
> > foo = foreach LOGS_BASE generate field1, field2, (float) reqtime;
> >
> > then group / max on foo.
> >
> > D
> >
> >
> > On Tue, Dec 20, 2011 at 3:58 PM, Grig Gheorghiu <
> [email protected]>wrote:
> >
> >> Here it is
> >>
> >> Backend error message
> >> ---------------------
> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
> >> Error while computing max in Initial
> >>         at
> org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:84)
> >>        at org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:64)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> >>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >>        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
> >>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
> >>        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >>        at java.security.AccessController.doPrivileged(Native Method)
> >>        at javax.security.auth.Subject.doAs(Subject.java:396)
> >>        at
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> >>        at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >> Caused by: java.lang.ClassCastException: java.lang.String cannot be
> >> cast to java.lang.Float
> >>        at org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:76)
> >>        ... 19 more
> >>
> >> Pig Stack Trace
> >> ---------------
> >> ERROR 2997: Unable to recreate exception from backed error:
> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
> >> Error while computing max in Initial
> >>
> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2997:
> >> Unable to recreate exception from backed error:
> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
> >> Error while computing max in Initial
> >>         at
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
> >>        at
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:343)
> >>        at org.apache.pig.PigServer.launchPlan(PigServer.java:1314)
> >>        at
> >> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1299)
> >>        at org.apache.pig.PigServer.execute(PigServer.java:1286)
> >>
> >> On Tue, Dec 20, 2011 at 2:44 PM, Dmitriy Ryaboy <[email protected]>
> >> wrote:
> >> > Can you take a look at the failed map tasks and paste their stack
> traces?
> >> >
> >> > On Tue, Dec 20, 2011 at 2:40 PM, Grig Gheorghiu
> >> > <[email protected]> wrote:
> >> >> Thanks Dmitriy! I made some progress by using MAX without the
> >> >> piggybank declaration, but I still got an error:
> >> >>
> >> >> 2011-12-20 22:35:25,832 [main] INFO  org.apache.pig.Main - Logging
> >> >> error messages to: /home/hadoop/pigscripts/pig_1324420525827.log
> >> >> 2011-12-20 22:35:26,139 [main] INFO
> >> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> >> Connecting to hadoop file system at: hdfs://10.110.209.25:9000
> >> >> 2011-12-20 22:35:26,615 [main] INFO
> >> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> >> Connecting to map-reduce job tracker at: 10.110.209.25:9001
> >> >> 2011-12-20 22:35:27,815 [main] INFO
> >> >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> >> >> script: GROUP_BY,FILTER
> >> >> 2011-12-20 22:35:29,109 [main] INFO
> >> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> >> >> - File concatenation threshold: 100 optimistic? false
> >> >> 2011-12-20 22:35:29,121 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
> >> >> - Choosing to move algebraic foreach to combiner
> >> >> 2011-12-20 22:35:29,148 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> >> - MR plan size before optimization: 1
> >> >> 2011-12-20 22:35:29,148 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> >> - MR plan size after optimization: 1
> >> >> 2011-12-20 22:35:29,293 [main] INFO
> >> >> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> >> >> added to the job
> >> >> 2011-12-20 22:35:29,308 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> >> - mapred.job.reduce.markreset.buffer.percent is not set, set to
> >> >> default 0.3
> >> >> 2011-12-20 22:35:37,231 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> >> - Setting up single store job
> >> >> 2011-12-20 22:35:37,276 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> >> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0
> >> >> 2011-12-20 22:35:37,276 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> >> - Neither PARALLEL nor default parallelism is set for this job.
> >> >> Setting number of reducers to 1
> >> >> 2011-12-20 22:35:37,328 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> >> - 1 map-reduce job(s) waiting for submission.
> >> >> 2011-12-20 22:35:37,345 [Thread-5] INFO
> >> >> org.apache.hadoop.mapred.JobClient - Default number of map tasks:
> null
> >> >> 2011-12-20 22:35:37,345 [Thread-5] INFO
> >> >> org.apache.hadoop.mapred.JobClient - Setting default number of map
> >> >> tasks based on cluster size to : 32
> >> >> 2011-12-20 22:35:37,346 [Thread-5] INFO
> >> >> org.apache.hadoop.mapred.JobClient - Default number of reduce tasks:
> 1
> >> >> 2011-12-20 22:35:37,829 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> >> - 0% complete
> >> >> 2011-12-20 22:35:38,083 [Thread-5] INFO
> >> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >> >> paths to process : 1
> >> >> 2011-12-20 22:35:38,083 [Thread-5] INFO
> >> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> >> input paths to process : 1
> >> >> 2011-12-20 22:35:38,093 [Thread-5] INFO
> >> >> com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl
> >> >> library
> >> >> 2011-12-20 22:35:38,095 [Thread-5] WARN
> >> >> com.hadoop.compression.lzo.LzoCodec - Could not find build properties
> >> >> file with revision hash
> >> >> 2011-12-20 22:35:38,095 [Thread-5] INFO
> >> >> com.hadoop.compression.lzo.LzoCodec - Successfully loaded &
> >> >> initialized native-lzo library [hadoop-lzo rev UNKNOWN]
> >> >> 2011-12-20 22:35:38,102 [Thread-5] WARN
> >> >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
> >> >> library is available
> >> >> 2011-12-20 22:35:38,102 [Thread-5] INFO
> >> >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native
> >> >> library loaded
> >> >> 2011-12-20 22:35:38,105 [Thread-5] INFO
> >> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> >> input paths (combined) to process : 1
> >> >> 2011-12-20 22:35:38,921 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> >> - HadoopJobId: job_201112192006_0026
> >> >> 2011-12-20 22:35:38,922 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> >> - More information at:
> >> >> http://10.110.209.25:9100/jobdetails.jsp?jobid=job_201112192006_0026
> >> >> 2011-12-20 22:36:28,616 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> >> - job job_201112192006_0026 has failed! Stop running all dependent
> >> >> jobs
> >> >> 2011-12-20 22:36:28,616 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> >> - 100% complete
> >> >> 2011-12-20 22:36:28,633 [main] ERROR
> >> >> org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
> >> >> recreate exception from backed error:
> >> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
> >> >> Error while computing max in Initial
> >> >> 2011-12-20 22:36:28,633 [main] ERROR
> >> >> org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s)
> >> >> failed!
> >> >> 2011-12-20 22:36:28,635 [main] INFO
> >> >> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
> >> >>
> >> >> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
> >>  Features
> >> >> 0.20.205        0.9.1-amzn      hadoop  2011-12-20 22:35:29
> >> 2011-12-20
> >> >> 22:36:28        GROUP_BY,FILTER
> >> >>
> >> >> Failed!
> >> >>
> >> >> Failed Jobs:
> >> >> JobId   Alias   Feature Message Outputs
> >> >> job_201112192006_0026
> >>
> AGGREGATES,DATE_URL,FILTERED_DATE_URL,GROUP_BY_DATE_URL,LOGS_BASE,RAW_LOGS
> >>      GROUP_BY,COMBINER       Message:
> >> >> Job failed! Error - # of failed Map Tasks exceeded allowed limit.
> >> >> FailedCount: 1. LastFailedTask:
> >> >> task_201112192006_0026_m_000000
> >> s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235,
> >> >>
> >> >> Input(s):
> >> >> Failed to read data from "s3://mapreduce.bucket/nginx/*test*.gz"
> >> >>
> >> >> Output(s):
> >> >> Failed to produce result in
> >> >> "s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235"
> >> >>
> >> >> Counters:
> >> >> Total records written : 0
> >> >> Total bytes written : 0
> >> >> Spillable Memory Manager spill count : 0
> >> >> Total bags proactively spilled: 0
> >> >> Total records proactively spilled: 0
> >> >>
> >> >> Job DAG:
> >> >> job_201112192006_0026
> >> >>
> >> >>
> >> >> 2011-12-20 22:36:28,635 [main] INFO
> >> >>
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> >> - Failed!
> >> >> 2011-12-20 22:36:29,181 [main] ERROR
> >> >> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
> >> >> recreate exception from backed error:
> >> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
> >> >> Error while computing max in Initial
> >> >> Details at logfile: /home/hadoop/pigscripts/pig_1324420525827.log
> >> >>
> >> >> On Tue, Dec 20, 2011 at 2:32 PM, Dmitriy Ryaboy <[email protected]>
> >> wrote:
> >> >>> Try just using MAX (no piggybank). If that doesn't work, try
> FloatMax
> >> >>> (still no piggybank).
> >> >>>
> >> >>>
> >> >>> On Tue, Dec 20, 2011 at 2:19 PM, Grig Gheorghiu
> >> >>> <[email protected]> wrote:
> >> >>>> Hello,
> >> >>>>
> >> >>>> Noob here. I am trying to analyze some Nginx log files and get some
> >> >>>> aggregate stats based on date and URL. Here is the beginning of a
> Pig
> >> >>>> script I have (I am running this in Elastic MapReduce, with Pig
> >> >>>> 0.9.1):
> >> >>>>
> >> >>>>
> >> >>>> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
> >> >>>> DEFINE EXTRACT
> org.apache.pig.piggybank.evaluation.string.EXTRACT();
> >> >>>> DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX();
> >> >>>> RAW_LOGS = LOAD '$INPUT' as (line:chararray);
> >> >>>> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> >> >>>>
> >> >>>> FLATTEN(
> >> >>>>        EXTRACT(line, '(\\S+) - -
> >> >>>> \\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+)
> >> >>>>
> >>
> ([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)')
> >> >>>> )
> >> >>>> AS (
> >> >>>>        ip: chararray,
> >> >>>>        day: chararray,
> >> >>>>        month: chararray,
> >> >>>>        year: chararray,
> >> >>>>        hour: chararray,
> >> >>>>        minute: chararray,
> >> >>>>        second: chararray,
> >> >>>>        tzoffset: chararray,
> >> >>>>        url: chararray,
> >> >>>>        status: chararray,
> >> >>>>        bytes: chararray,
> >> >>>>        referrer: chararray,
> >> >>>>        useragent: chararray,
> >> >>>>        xfwd: chararray,
> >> >>>>        reqtime: float
> >> >>>> );
> >> >>>> DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url,
> reqtime;
> >> >>>> FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL;
> >> >>>> GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day,
> >> url);
> >> >>>>
> >> >>>> Now I would like to get the MAX of the request time. If I do this:
> >> >>>>
> >> >>>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
> >> >>>> (year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as
> maxreqtime;
> >> >>>> STORE AGGREGATES INTO '$OUTPUT';
> >> >>>>
> >> >>>> I get this error
> >> >>>>
> >> >>>> 2011-12-20 22:16:32,147 [main] ERROR
> org.apache.pig.tools.grunt.Grunt
> >> >>>> - ERROR 1045:
> >> >>>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
> >> >>>> 91> Could not infer the matching function for
> >> >>>> org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of
> >> >>>> them fit. Please use an explicit cast.
> >> >>>>
> >> >>>> If I do this cast:
> >> >>>>
> >> >>>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as
> >> >>>> (year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as
> >> >>>> maxreqtime;
> >> >>>>
> >> >>>> I get another error:
> >> >>>>
> >> >>>> 2011-12-20 22:18:35,115 [main] ERROR
> org.apache.pig.tools.grunt.Grunt
> >> >>>> - ERROR 1052:
> >> >>>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column
> >> >>>> 96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to
> float
> >> >>>>
> >> >>>> Any help would be greatly appreciated.
> >> >>>>
> >> >>>> Grig
> >>
>

Re: Problems with aggregate functions

Reply via email to