Hi,

I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
order to get json parsing. I have an incredibly unusual error that I see
with certain gzip compressed files. It's probably easiest to show you a pig
session:

grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
grunt> register '/home/joe/json-simple-1.1.jar';
grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
TextLoader() as (line: chararray);
grunt> X = FOREACH apiHits GENERATE line,
com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
grunt> Y = LIMIT X 2;
grunt> dump Y;
(succeeds, and I get what I expect).

Now, if I try to do a projection using the json field, I get the following:

grunt> A = FILTER X BY
>>   json#'logtype' == 'foo'
>>   OR json#'consumer' == 'foo1'
>>   OR json#'consumer' == 'foo2'
>>   OR json#'consumer' == 'foo3'
>>   OR json#'consumer' == 'foo4'
>>   ;
grunt> B = LIMIT A 2;
grunt> dump B;

ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: java.lang.Long
cannot be cast to org.json.simple.JSONObject

And in the task tracker logs, the stack trace suggests that the json udf is
seeing compressed data [1]. Does anyone have any ideas how to debug this,
or guesses to what the problem is? Can I somehow determine if hadoop is
actually decompressing the data or not?

Thanks!
Joe

[1]

2012-04-05 14:39:20,211 WARN
com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
json-decode string: ����
Unexpected character () at position 0.
        at org.json.simple.parser.Yylex.yylex(Unknown Source)
        at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
        at org.json.simple.parser.JSONParser.parse(Unknown Source)
        at org.json.simple.parser.JSONParser.parse(Unknown Source)
        at org.json.simple.parser.JSONParser.parse(Unknown Source)
        at 
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
        at 
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
        at 
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
        at org.apache.hadoop.mapred.Child.main(Child.java:264)

Reply via email to