Hi,
I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
order to get json parsing. I have an incredibly unusual error that I see
with certain gzip compressed files. It's probably easiest to show you a pig
session:
grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
grunt> register '/home/joe/json-simple-1.1.jar';
grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
TextLoader() as (line: chararray);
grunt> X = FOREACH apiHits GENERATE line,
com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
grunt> Y = LIMIT X 2;
grunt> dump Y;
(succeeds, and I get what I expect).
Now, if I try to do a projection using the json field, I get the following:
grunt> A = FILTER X BY
>> json#'logtype' == 'foo'
>> OR json#'consumer' == 'foo1'
>> OR json#'consumer' == 'foo2'
>> OR json#'consumer' == 'foo3'
>> OR json#'consumer' == 'foo4'
>> ;
grunt> B = LIMIT A 2;
grunt> dump B;
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: java.lang.Long
cannot be cast to org.json.simple.JSONObject
And in the task tracker logs, the stack trace suggests that the json udf is
seeing compressed data [1]. Does anyone have any ideas how to debug this,
or guesses to what the problem is? Can I somehow determine if hadoop is
actually decompressing the data or not?
Thanks!
Joe
[1]
2012-04-05 14:39:20,211 WARN
com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
json-decode string: ����
Unexpected character () at position 0.
at org.json.simple.parser.Yylex.yylex(Unknown Source)
at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
at org.json.simple.parser.JSONParser.parse(Unknown Source)
at org.json.simple.parser.JSONParser.parse(Unknown Source)
at org.json.simple.parser.JSONParser.parse(Unknown Source)
at
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
at
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
at
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)