Thanks Dimitry, that work ...
On Fri, Feb 11, 2011 at 12:06 AM, Dmitriy Ryaboy <[email protected]> wrote: > Charles, > If you are iterating through a relation, you don't need to refer to it > in the statement. > > meaning: > > C = FILTER B BY valid(B.url); > > should be > > C = FILTER B BY valid(url); > > (you already have access to the rows, not to the relation B). > > The error you are getting is from a new feature that allows you to > pretend that some relation is a scalar and use that scalar value > transparently when iterating over another relation eg: > > total = foreach (group stuff all) generate COUNT($1) as cnt; > percent = foreach (group stuff by type) generate COUNT($1) / total.cnt > > Here, I am using the "total" relation as a single-row relation, > essentially promising Pig that total.cnt is only a single value. > In your case you are doing that to a multi-row relation, and things blow > up. > > D > > On Thu, Feb 10, 2011 at 5:42 PM, Charles Gonçalves <[email protected]> > wrote: > > I'm trying just to do a breakdown for all my logs but every time I use a > > operation like : > > FILTER alias BY some_udf(alias); > > I got a problem. > > > > First I got : ERROR 0: Scalar has more than one row in the output. : > > > > cfgmc@phoebe:~/workspace-java/MscPigScripts/scripts (121) 23:11:16 > > scripts:> pig -x local > > grunt> REGISTER > > /home/speed/cfgmc/workspace-java/MscPigScripts/jar/MscPigUtils.jar > > grunt> > > grunt> -- Functions Definitions > > grunt> DEFINE EdgeLoader msc.pig.EdgeLoader(); > > grunt> DEFINE valid msc.pig.IsValidUrl(); > > grunt> raw = LOAD '../inputTestes/wpc_sample.gz' using EdgeLoader; > > grunt> Describe raw > > raw: {ts: long,timeTaken: int,cIp: chararray,fSize: long,sIp: > > chararray,sPort: int,scStatus: chararray,scBytes: long,csMethod: > > chararray,url: chararray,rsDuration: int,rsBytes: int,referrer: > > chararray,ua: chararray,edgeId: chararray} > > grunt> B = FOREACH raw GENERATE cIp,url ; > > grunt> describe B; > > B: {cIp: chararray,url: chararray} > > grunt> *C = FILTER B BY valid(B.url);* > > grunt> describe C; > > C: {cIp: chararray,url: chararray} > > grunt> D = GROUP C BY B.cIp; > > grunt> describe D; > > D: {group: chararray,C: {cIp: chararray,url: chararray}} > > grunt> urls_ok = FOREACH D GENERATE COUNT(C.url); > > grunt> describe urls_ok; > > urls_ok: {long} > > grunt> dump urls_ok; > > > > > > org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has > > more than one row in the output. 1st : (187.113.41.93, > > > http://webcast.sambatech.com.br/000482/account/8/3/ed92827f3e722bfbbabf89aa4adb0068/ER7_FA_3009_CARRASCONANYDIF_470kbps_2010-09-30.mp4 > ), > > 2nd :(186.213.248.23, > > > http://webcast.sambatech.com.br/000482/account/8/3/thumbnail/media/ea41d211f4e277821cb3e9fd392a51cf/R7_CH_TINAROMA_EMAILR7FAZENDA_470kbps_2010-09-140.03426408348605037.jpg > > ) > > at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:89) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:325) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:169) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:212) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:289) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:148) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.getNext(POPreCombinerLocalRearrange.java:127) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:240) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > > > Then I got : > > > > grunt> REGISTER > > /home/speed/cfgmc/workspace-java/MscPigScripts/jar/MscPigUtils.jar > > grunt> DEFINE EdgeLoader msc.pig.EdgeLoader(); > > grunt> DEFINE valid msc.pig.IsValidUrl(); > > grunt> raw = LOAD '../inputTestes/wpc_sample.gz' using EdgeLoader; > > grunt> B = FOREACH raw GENERATE cIp, sIp, sPort, scStatus, csMethod, > > scBytes, url ; > > grunt> describe B; > > B: {cIp: chararray,sIp: chararray,sPort: int,scStatus: > chararray,csMethod: > > chararray,scBytes: long,url: chararray} > > grunt> E = GROUP B ALL ; > > grunt> describe E; > > E: {group: chararray,B: {cIp: chararray,sIp: chararray,sPort: > int,scStatus: > > chararray,csMethod: chararray,scBytes: long,url: chararray}} > > > > grunt> edge_breakdown = FOREACH E { > >>> dist_cIps = DISTINCT B.cIp; > >>> dist_sIps = DISTINCT B.sIp; > >>> *urls_ok = FILTER B BY valid(B.url);* > >>> GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url), > > COUNT(B.url), SUM(B.scBytes); > >>> } > > grunt> DESC > > > > DESC DESCRIBE > > grunt> DESCRIBE edge_breakdown; > > 2011-02-10 23:36:35,274 [main] INFO > > org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics > > with processName=JobTracker, sessionId= - already initialized > > 2011-02-10 23:36:35,301 [main] ERROR > org.apache.pig.impl.plan.OperatorPlan - > > Attempt to connect operator urls_ok: Filter 1-196 which is not in the > plan. > > 2011-02-10 23:36:35,302 [main] ERROR org.apache.pig.tools.grunt.Grunt - > > ERROR 2219: Unable to process scalar in the plan > > Details at logfile: > > > /home/speed/cfgmc/workspace-java/MscPigScripts/scripts/pig_1297388063472.log > > grunt> > > > > The log file says: > > > > Pig Stack Trace > > --------------- > > ERROR 2219: Unable to process scalar in the plan > > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to > > describe schema for alias edge_breakdown > > at org.apache.pig.PigServer.dumpSchema(PigServer.java:653) > > at > > > org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:236) > > at > > > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:315) > > at > > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) > > at > > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) > > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76) > > at org.apache.pig.Main.run(Main.java:465) > > at org.apache.pig.Main.main(Main.java:107) > > Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR > 2219: > > Unable to process scalar in the plan > > at org.apache.pig.PigServer.mergeScalars(PigServer.java:1299) > > at org.apache.pig.PigServer.compileLp(PigServer.java:1304) > > at org.apache.pig.PigServer.compileLp(PigServer.java:1241) > > at org.apache.pig.PigServer.dumpSchema(PigServer.java:639) > > ... 7 more > > Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to > > connect operator urls_ok: Filter 1-196 which is not in the plan. > > at > org.apache.pig.impl.plan.OperatorPlan.checkInPlan(OperatorPlan.java:409) > > at > > > org.apache.pig.impl.plan.OperatorPlan.createSoftLink(OperatorPlan.java:210) > > at org.apache.pig.PigServer.mergeScalars(PigServer.java:1294) > > ... 10 more > > > ================================================================================ > > > > If I run the last script without the Filter inside the inner foreach it > > works perfecty. The udf is used perfectly in other contexts and works > fine. > > > > > > > > Guys, seriously, what I'm missing here? > > I got stuck all day on this issue! > > > > > > -- > > *Charles Ferreira Gonçalves * > > http://homepages.dcc.ufmg.br/~charles/ > > UFMG - ICEx - Dcc > > Cel.: 55 31 87741485 > > Tel.: 55 31 34741485 > > Lab.: 55 31 34095840 > > > -- *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840
