Charles, If you are iterating through a relation, you don't need to refer to it in the statement.
meaning: C = FILTER B BY valid(B.url); should be C = FILTER B BY valid(url); (you already have access to the rows, not to the relation B). The error you are getting is from a new feature that allows you to pretend that some relation is a scalar and use that scalar value transparently when iterating over another relation eg: total = foreach (group stuff all) generate COUNT($1) as cnt; percent = foreach (group stuff by type) generate COUNT($1) / total.cnt Here, I am using the "total" relation as a single-row relation, essentially promising Pig that total.cnt is only a single value. In your case you are doing that to a multi-row relation, and things blow up. D On Thu, Feb 10, 2011 at 5:42 PM, Charles Gonçalves <[email protected]> wrote: > I'm trying just to do a breakdown for all my logs but every time I use a > operation like : > FILTER alias BY some_udf(alias); > I got a problem. > > First I got : ERROR 0: Scalar has more than one row in the output. : > > cfgmc@phoebe:~/workspace-java/MscPigScripts/scripts (121) 23:11:16 > scripts:> pig -x local > grunt> REGISTER > /home/speed/cfgmc/workspace-java/MscPigScripts/jar/MscPigUtils.jar > grunt> > grunt> -- Functions Definitions > grunt> DEFINE EdgeLoader msc.pig.EdgeLoader(); > grunt> DEFINE valid msc.pig.IsValidUrl(); > grunt> raw = LOAD '../inputTestes/wpc_sample.gz' using EdgeLoader; > grunt> Describe raw > raw: {ts: long,timeTaken: int,cIp: chararray,fSize: long,sIp: > chararray,sPort: int,scStatus: chararray,scBytes: long,csMethod: > chararray,url: chararray,rsDuration: int,rsBytes: int,referrer: > chararray,ua: chararray,edgeId: chararray} > grunt> B = FOREACH raw GENERATE cIp,url ; > grunt> describe B; > B: {cIp: chararray,url: chararray} > grunt> *C = FILTER B BY valid(B.url);* > grunt> describe C; > C: {cIp: chararray,url: chararray} > grunt> D = GROUP C BY B.cIp; > grunt> describe D; > D: {group: chararray,C: {cIp: chararray,url: chararray}} > grunt> urls_ok = FOREACH D GENERATE COUNT(C.url); > grunt> describe urls_ok; > urls_ok: {long} > grunt> dump urls_ok; > > > org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has > more than one row in the output. 1st : (187.113.41.93, > http://webcast.sambatech.com.br/000482/account/8/3/ed92827f3e722bfbbabf89aa4adb0068/ER7_FA_3009_CARRASCONANYDIF_470kbps_2010-09-30.mp4), > 2nd :(186.213.248.23, > http://webcast.sambatech.com.br/000482/account/8/3/thumbnail/media/ea41d211f4e277821cb3e9fd392a51cf/R7_CH_TINAROMA_EMAILR7FAZENDA_470kbps_2010-09-140.03426408348605037.jpg > ) > at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:89) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:325) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:169) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:212) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:289) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:148) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.getNext(POPreCombinerLocalRearrange.java:127) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:240) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > Then I got : > > grunt> REGISTER > /home/speed/cfgmc/workspace-java/MscPigScripts/jar/MscPigUtils.jar > grunt> DEFINE EdgeLoader msc.pig.EdgeLoader(); > grunt> DEFINE valid msc.pig.IsValidUrl(); > grunt> raw = LOAD '../inputTestes/wpc_sample.gz' using EdgeLoader; > grunt> B = FOREACH raw GENERATE cIp, sIp, sPort, scStatus, csMethod, > scBytes, url ; > grunt> describe B; > B: {cIp: chararray,sIp: chararray,sPort: int,scStatus: chararray,csMethod: > chararray,scBytes: long,url: chararray} > grunt> E = GROUP B ALL ; > grunt> describe E; > E: {group: chararray,B: {cIp: chararray,sIp: chararray,sPort: int,scStatus: > chararray,csMethod: chararray,scBytes: long,url: chararray}} > > grunt> edge_breakdown = FOREACH E { >>> dist_cIps = DISTINCT B.cIp; >>> dist_sIps = DISTINCT B.sIp; >>> *urls_ok = FILTER B BY valid(B.url);* >>> GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url), > COUNT(B.url), SUM(B.scBytes); >>> } > grunt> DESC > > DESC DESCRIBE > grunt> DESCRIBE edge_breakdown; > 2011-02-10 23:36:35,274 [main] INFO > org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics > with processName=JobTracker, sessionId= - already initialized > 2011-02-10 23:36:35,301 [main] ERROR org.apache.pig.impl.plan.OperatorPlan - > Attempt to connect operator urls_ok: Filter 1-196 which is not in the plan. > 2011-02-10 23:36:35,302 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 2219: Unable to process scalar in the plan > Details at logfile: > /home/speed/cfgmc/workspace-java/MscPigScripts/scripts/pig_1297388063472.log > grunt> > > The log file says: > > Pig Stack Trace > --------------- > ERROR 2219: Unable to process scalar in the plan > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to > describe schema for alias edge_breakdown > at org.apache.pig.PigServer.dumpSchema(PigServer.java:653) > at > org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:236) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:315) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76) > at org.apache.pig.Main.run(Main.java:465) > at org.apache.pig.Main.main(Main.java:107) > Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2219: > Unable to process scalar in the plan > at org.apache.pig.PigServer.mergeScalars(PigServer.java:1299) > at org.apache.pig.PigServer.compileLp(PigServer.java:1304) > at org.apache.pig.PigServer.compileLp(PigServer.java:1241) > at org.apache.pig.PigServer.dumpSchema(PigServer.java:639) > ... 7 more > Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to > connect operator urls_ok: Filter 1-196 which is not in the plan. > at org.apache.pig.impl.plan.OperatorPlan.checkInPlan(OperatorPlan.java:409) > at > org.apache.pig.impl.plan.OperatorPlan.createSoftLink(OperatorPlan.java:210) > at org.apache.pig.PigServer.mergeScalars(PigServer.java:1294) > ... 10 more > ================================================================================ > > If I run the last script without the Filter inside the inner foreach it > works perfecty. The udf is used perfectly in other contexts and works fine. > > > > Guys, seriously, what I'm missing here? > I got stuck all day on this issue! > > > -- > *Charles Ferreira Gonçalves * > http://homepages.dcc.ufmg.br/~charles/ > UFMG - ICEx - Dcc > Cel.: 55 31 87741485 > Tel.: 55 31 34741485 > Lab.: 55 31 34095840 >
