I'm trying just to do a breakdown for all my logs but every time I use  a
operation like :
FILTER alias BY some_udf(alias);
I got a problem.

First  I got : ERROR 0: Scalar has more than one row in the output. :

cfgmc@phoebe:~/workspace-java/MscPigScripts/scripts (121) 23:11:16
scripts:> pig -x local
grunt> REGISTER
/home/speed/cfgmc/workspace-java/MscPigScripts/jar/MscPigUtils.jar
grunt>
grunt> -- Functions Definitions
grunt> DEFINE EdgeLoader msc.pig.EdgeLoader();
grunt> DEFINE valid msc.pig.IsValidUrl();
grunt> raw = LOAD '../inputTestes/wpc_sample.gz' using EdgeLoader;
grunt> Describe raw
raw: {ts: long,timeTaken: int,cIp: chararray,fSize: long,sIp:
chararray,sPort: int,scStatus: chararray,scBytes: long,csMethod:
chararray,url: chararray,rsDuration: int,rsBytes: int,referrer:
chararray,ua: chararray,edgeId: chararray}
grunt> B = FOREACH raw GENERATE cIp,url ;
grunt> describe B;
B: {cIp: chararray,url: chararray}
grunt> *C = FILTER B BY valid(B.url);*
grunt> describe C;
C: {cIp: chararray,url: chararray}
grunt> D = GROUP C BY B.cIp;
grunt> describe D;
D: {group: chararray,C: {cIp: chararray,url: chararray}}
grunt> urls_ok = FOREACH D GENERATE COUNT(C.url);
grunt> describe urls_ok;
urls_ok: {long}
grunt> dump urls_ok;


org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
more than one row in the output. 1st : (187.113.41.93,
http://webcast.sambatech.com.br/000482/account/8/3/ed92827f3e722bfbbabf89aa4adb0068/ER7_FA_3009_CARRASCONANYDIF_470kbps_2010-09-30.mp4),
2nd :(186.213.248.23,
http://webcast.sambatech.com.br/000482/account/8/3/thumbnail/media/ea41d211f4e277821cb3e9fd392a51cf/R7_CH_TINAROMA_EMAILR7FAZENDA_470kbps_2010-09-140.03426408348605037.jpg
)
 at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:89)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:325)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:169)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:212)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:289)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:148)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.getNext(POPreCombinerLocalRearrange.java:127)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:240)
 at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Then I got :

grunt> REGISTER
/home/speed/cfgmc/workspace-java/MscPigScripts/jar/MscPigUtils.jar
grunt> DEFINE EdgeLoader msc.pig.EdgeLoader();
grunt> DEFINE valid msc.pig.IsValidUrl();
grunt> raw = LOAD '../inputTestes/wpc_sample.gz' using EdgeLoader;
grunt> B = FOREACH raw GENERATE cIp, sIp, sPort, scStatus, csMethod,
scBytes, url ;
grunt> describe B;
B: {cIp: chararray,sIp: chararray,sPort: int,scStatus: chararray,csMethod:
chararray,scBytes: long,url: chararray}
grunt> E = GROUP B ALL ;
grunt> describe E;
E: {group: chararray,B: {cIp: chararray,sIp: chararray,sPort: int,scStatus:
chararray,csMethod: chararray,scBytes: long,url: chararray}}

grunt> edge_breakdown = FOREACH E {
>> dist_cIps = DISTINCT B.cIp;
>> dist_sIps = DISTINCT B.sIp;
>> *urls_ok = FILTER B BY valid(B.url);*
>> GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
COUNT(B.url), SUM(B.scBytes);
>> }
grunt> DESC

DESC       DESCRIBE
grunt> DESCRIBE edge_breakdown;
2011-02-10 23:36:35,274 [main] INFO
 org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
2011-02-10 23:36:35,301 [main] ERROR org.apache.pig.impl.plan.OperatorPlan -
Attempt to connect operator urls_ok: Filter 1-196 which is not in the plan.
2011-02-10 23:36:35,302 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2219: Unable to process scalar in the plan
Details at logfile:
/home/speed/cfgmc/workspace-java/MscPigScripts/scripts/pig_1297388063472.log
grunt>

The log file  says:

Pig Stack Trace
---------------
ERROR 2219: Unable to process scalar in the plan

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to
describe schema for alias edge_breakdown
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:653)
at
org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:236)
 at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:315)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
 at org.apache.pig.Main.run(Main.java:465)
at org.apache.pig.Main.main(Main.java:107)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2219:
Unable to process scalar in the plan
at org.apache.pig.PigServer.mergeScalars(PigServer.java:1299)
 at org.apache.pig.PigServer.compileLp(PigServer.java:1304)
at org.apache.pig.PigServer.compileLp(PigServer.java:1241)
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:639)
... 7 more
Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to
connect operator urls_ok: Filter 1-196 which is not in the plan.
 at org.apache.pig.impl.plan.OperatorPlan.checkInPlan(OperatorPlan.java:409)
at
org.apache.pig.impl.plan.OperatorPlan.createSoftLink(OperatorPlan.java:210)
 at org.apache.pig.PigServer.mergeScalars(PigServer.java:1294)
... 10 more
================================================================================

If I run the last  script without the Filter inside the inner foreach it
works perfecty. The udf is used perfectly in other contexts and works fine.



Guys, seriously, what I'm missing here?
I got stuck all day on this issue!


-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Reply via email to