Thanks, Jonathan, that's good to know.
On Tue, Dec 20, 2011 at 5:25 PM, Jonathan Coveney <[email protected]> wrote: > Grig, just for context on Dmitriy's help, doing :type does not, generally, > actually coerce the type over. There is a JIRA about the behavior when > using :float, for example, vs. (float), but in cases where a UDF, for > example, returns a chararray and you need to convert it, the safest bet is > to explicitly cast it. > > 2011/12/20 Grig Gheorghiu <[email protected]> > >> That made it work! Thanks so much, Dmitriy, you rock!!! >> >> Grig >> >> On Tue, Dec 20, 2011 at 4:11 PM, Dmitriy Ryaboy <[email protected]> >> wrote: >> > Sounds like when the regex returns matched strings, even though you say >> > that the time should be a float, it's actually a String. >> > >> > Try >> > >> > foo = foreach LOGS_BASE generate field1, field2, (float) reqtime; >> > >> > then group / max on foo. >> > >> > D >> > >> > >> > On Tue, Dec 20, 2011 at 3:58 PM, Grig Gheorghiu < >> [email protected]>wrote: >> > >> >> Here it is >> >> >> >> Backend error message >> >> --------------------- >> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: >> >> Error while computing max in Initial >> >> at >> org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:84) >> >> at org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:64) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) >> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >> >> at >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771) >> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375) >> >> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >> >> at java.security.AccessController.doPrivileged(Native Method) >> >> at javax.security.auth.Subject.doAs(Subject.java:396) >> >> at >> >> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >> >> at org.apache.hadoop.mapred.Child.main(Child.java:249) >> >> Caused by: java.lang.ClassCastException: java.lang.String cannot be >> >> cast to java.lang.Float >> >> at org.apache.pig.builtin.FloatMax$Initial.exec(FloatMax.java:76) >> >> ... 19 more >> >> >> >> Pig Stack Trace >> >> --------------- >> >> ERROR 2997: Unable to recreate exception from backed error: >> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: >> >> Error while computing max in Initial >> >> >> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2997: >> >> Unable to recreate exception from backed error: >> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: >> >> Error while computing max in Initial >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151) >> >> at >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:343) >> >> at org.apache.pig.PigServer.launchPlan(PigServer.java:1314) >> >> at >> >> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1299) >> >> at org.apache.pig.PigServer.execute(PigServer.java:1286) >> >> >> >> On Tue, Dec 20, 2011 at 2:44 PM, Dmitriy Ryaboy <[email protected]> >> >> wrote: >> >> > Can you take a look at the failed map tasks and paste their stack >> traces? >> >> > >> >> > On Tue, Dec 20, 2011 at 2:40 PM, Grig Gheorghiu >> >> > <[email protected]> wrote: >> >> >> Thanks Dmitriy! I made some progress by using MAX without the >> >> >> piggybank declaration, but I still got an error: >> >> >> >> >> >> 2011-12-20 22:35:25,832 [main] INFO org.apache.pig.Main - Logging >> >> >> error messages to: /home/hadoop/pigscripts/pig_1324420525827.log >> >> >> 2011-12-20 22:35:26,139 [main] INFO >> >> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >> >> >> Connecting to hadoop file system at: hdfs://10.110.209.25:9000 >> >> >> 2011-12-20 22:35:26,615 [main] INFO >> >> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >> >> >> Connecting to map-reduce job tracker at: 10.110.209.25:9001 >> >> >> 2011-12-20 22:35:27,815 [main] INFO >> >> >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the >> >> >> script: GROUP_BY,FILTER >> >> >> 2011-12-20 22:35:29,109 [main] INFO >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler >> >> >> - File concatenation threshold: 100 optimistic? false >> >> >> 2011-12-20 22:35:29,121 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer >> >> >> - Choosing to move algebraic foreach to combiner >> >> >> 2011-12-20 22:35:29,148 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >> >> >> - MR plan size before optimization: 1 >> >> >> 2011-12-20 22:35:29,148 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >> >> >> - MR plan size after optimization: 1 >> >> >> 2011-12-20 22:35:29,293 [main] INFO >> >> >> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are >> >> >> added to the job >> >> >> 2011-12-20 22:35:29,308 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >> >> >> - mapred.job.reduce.markreset.buffer.percent is not set, set to >> >> >> default 0.3 >> >> >> 2011-12-20 22:35:37,231 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >> >> >> - Setting up single store job >> >> >> 2011-12-20 22:35:37,276 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >> >> >> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0 >> >> >> 2011-12-20 22:35:37,276 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >> >> >> - Neither PARALLEL nor default parallelism is set for this job. >> >> >> Setting number of reducers to 1 >> >> >> 2011-12-20 22:35:37,328 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> >> - 1 map-reduce job(s) waiting for submission. >> >> >> 2011-12-20 22:35:37,345 [Thread-5] INFO >> >> >> org.apache.hadoop.mapred.JobClient - Default number of map tasks: >> null >> >> >> 2011-12-20 22:35:37,345 [Thread-5] INFO >> >> >> org.apache.hadoop.mapred.JobClient - Setting default number of map >> >> >> tasks based on cluster size to : 32 >> >> >> 2011-12-20 22:35:37,346 [Thread-5] INFO >> >> >> org.apache.hadoop.mapred.JobClient - Default number of reduce tasks: >> 1 >> >> >> 2011-12-20 22:35:37,829 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> >> - 0% complete >> >> >> 2011-12-20 22:35:38,083 [Thread-5] INFO >> >> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input >> >> >> paths to process : 1 >> >> >> 2011-12-20 22:35:38,083 [Thread-5] INFO >> >> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> >> >> input paths to process : 1 >> >> >> 2011-12-20 22:35:38,093 [Thread-5] INFO >> >> >> com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl >> >> >> library >> >> >> 2011-12-20 22:35:38,095 [Thread-5] WARN >> >> >> com.hadoop.compression.lzo.LzoCodec - Could not find build properties >> >> >> file with revision hash >> >> >> 2011-12-20 22:35:38,095 [Thread-5] INFO >> >> >> com.hadoop.compression.lzo.LzoCodec - Successfully loaded & >> >> >> initialized native-lzo library [hadoop-lzo rev UNKNOWN] >> >> >> 2011-12-20 22:35:38,102 [Thread-5] WARN >> >> >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native >> >> >> library is available >> >> >> 2011-12-20 22:35:38,102 [Thread-5] INFO >> >> >> org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native >> >> >> library loaded >> >> >> 2011-12-20 22:35:38,105 [Thread-5] INFO >> >> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> >> >> input paths (combined) to process : 1 >> >> >> 2011-12-20 22:35:38,921 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> >> - HadoopJobId: job_201112192006_0026 >> >> >> 2011-12-20 22:35:38,922 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> >> - More information at: >> >> >> http://10.110.209.25:9100/jobdetails.jsp?jobid=job_201112192006_0026 >> >> >> 2011-12-20 22:36:28,616 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> >> - job job_201112192006_0026 has failed! Stop running all dependent >> >> >> jobs >> >> >> 2011-12-20 22:36:28,616 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> >> - 100% complete >> >> >> 2011-12-20 22:36:28,633 [main] ERROR >> >> >> org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to >> >> >> recreate exception from backed error: >> >> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: >> >> >> Error while computing max in Initial >> >> >> 2011-12-20 22:36:28,633 [main] ERROR >> >> >> org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) >> >> >> failed! >> >> >> 2011-12-20 22:36:28,635 [main] INFO >> >> >> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: >> >> >> >> >> >> HadoopVersion PigVersion UserId StartedAt FinishedAt >> >> Features >> >> >> 0.20.205 0.9.1-amzn hadoop 2011-12-20 22:35:29 >> >> 2011-12-20 >> >> >> 22:36:28 GROUP_BY,FILTER >> >> >> >> >> >> Failed! >> >> >> >> >> >> Failed Jobs: >> >> >> JobId Alias Feature Message Outputs >> >> >> job_201112192006_0026 >> >> >> AGGREGATES,DATE_URL,FILTERED_DATE_URL,GROUP_BY_DATE_URL,LOGS_BASE,RAW_LOGS >> >> GROUP_BY,COMBINER Message: >> >> >> Job failed! Error - # of failed Map Tasks exceeded allowed limit. >> >> >> FailedCount: 1. LastFailedTask: >> >> >> task_201112192006_0026_m_000000 >> >> s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235, >> >> >> >> >> >> Input(s): >> >> >> Failed to read data from "s3://mapreduce.bucket/nginx/*test*.gz" >> >> >> >> >> >> Output(s): >> >> >> Failed to produce result in >> >> >> "s3://mapreduce.bucket/nginx/pigoutput/nginx_201112202235" >> >> >> >> >> >> Counters: >> >> >> Total records written : 0 >> >> >> Total bytes written : 0 >> >> >> Spillable Memory Manager spill count : 0 >> >> >> Total bags proactively spilled: 0 >> >> >> Total records proactively spilled: 0 >> >> >> >> >> >> Job DAG: >> >> >> job_201112192006_0026 >> >> >> >> >> >> >> >> >> 2011-12-20 22:36:28,635 [main] INFO >> >> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> >> - Failed! >> >> >> 2011-12-20 22:36:29,181 [main] ERROR >> >> >> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to >> >> >> recreate exception from backed error: >> >> >> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: >> >> >> Error while computing max in Initial >> >> >> Details at logfile: /home/hadoop/pigscripts/pig_1324420525827.log >> >> >> >> >> >> On Tue, Dec 20, 2011 at 2:32 PM, Dmitriy Ryaboy <[email protected]> >> >> wrote: >> >> >>> Try just using MAX (no piggybank). If that doesn't work, try >> FloatMax >> >> >>> (still no piggybank). >> >> >>> >> >> >>> >> >> >>> On Tue, Dec 20, 2011 at 2:19 PM, Grig Gheorghiu >> >> >>> <[email protected]> wrote: >> >> >>>> Hello, >> >> >>>> >> >> >>>> Noob here. I am trying to analyze some Nginx log files and get some >> >> >>>> aggregate stats based on date and URL. Here is the beginning of a >> Pig >> >> >>>> script I have (I am running this in Elastic MapReduce, with Pig >> >> >>>> 0.9.1): >> >> >>>> >> >> >>>> >> >> >>>> REGISTER file:/home/hadoop/lib/pig/piggybank.jar; >> >> >>>> DEFINE EXTRACT >> org.apache.pig.piggybank.evaluation.string.EXTRACT(); >> >> >>>> DEFINE MAX org.apache.pig.piggybank.evaluation.math.MAX(); >> >> >>>> RAW_LOGS = LOAD '$INPUT' as (line:chararray); >> >> >>>> LOGS_BASE = FOREACH RAW_LOGS GENERATE >> >> >>>> >> >> >>>> FLATTEN( >> >> >>>> EXTRACT(line, '(\\S+) - - >> >> >>>> \\[([^/]+)\\/([^/]+)\\/(\\d+):(\\d+):(\\d+):(\\d+) >> >> >>>> >> >> >> ([+-]\\d+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)') >> >> >>>> ) >> >> >>>> AS ( >> >> >>>> ip: chararray, >> >> >>>> day: chararray, >> >> >>>> month: chararray, >> >> >>>> year: chararray, >> >> >>>> hour: chararray, >> >> >>>> minute: chararray, >> >> >>>> second: chararray, >> >> >>>> tzoffset: chararray, >> >> >>>> url: chararray, >> >> >>>> status: chararray, >> >> >>>> bytes: chararray, >> >> >>>> referrer: chararray, >> >> >>>> useragent: chararray, >> >> >>>> xfwd: chararray, >> >> >>>> reqtime: float >> >> >>>> ); >> >> >>>> DATE_URL = FOREACH LOGS_BASE GENERATE year, month, day, url, >> reqtime; >> >> >>>> FILTERED_DATE_URL = FILTER DATE_URL BY NOT url IS NULL; >> >> >>>> GROUP_BY_DATE_URL = GROUP FILTERED_DATE_URL BY (year, month, day, >> >> url); >> >> >>>> >> >> >>>> Now I would like to get the MAX of the request time. If I do this: >> >> >>>> >> >> >>>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as >> >> >>>> (year, month, day, url), MAX(FILTERED_DATE_URL.reqtime) as >> maxreqtime; >> >> >>>> STORE AGGREGATES INTO '$OUTPUT'; >> >> >>>> >> >> >>>> I get this error >> >> >>>> >> >> >>>> 2011-12-20 22:16:32,147 [main] ERROR >> org.apache.pig.tools.grunt.Grunt >> >> >>>> - ERROR 1045: >> >> >>>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column >> >> >>>> 91> Could not infer the matching function for >> >> >>>> org.apache.pig.piggybank.evaluation.math.MAX as multiple or none of >> >> >>>> them fit. Please use an explicit cast. >> >> >>>> >> >> >>>> If I do this cast: >> >> >>>> >> >> >>>> AGGREGATES = FOREACH GROUP_BY_DATE_URL GENERATE FLATTEN(group) as >> >> >>>> (year, month, day, url), MAX((float)FILTERED_DATE_URL.reqtime) as >> >> >>>> maxreqtime; >> >> >>>> >> >> >>>> I get another error: >> >> >>>> >> >> >>>> 2011-12-20 22:18:35,115 [main] ERROR >> org.apache.pig.tools.grunt.Grunt >> >> >>>> - ERROR 1052: >> >> >>>> <file /home/hadoop/pigscripts/nginx_access_log.pig, line 31, column >> >> >>>> 96> Cannot cast bag with schema :bag{:tuple(reqtime:float)} to >> float >> >> >>>> >> >> >>>> Any help would be greatly appreciated. >> >> >>>> >> >> >>>> Grig >> >> >>
