so it turns out our uncompressed data contains corrupted rows. Is there a
way to easily tell wrap the JsonStringToMap UDF to catch exceptions on
unparsable lines and just skip them?

On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <[email protected]> wrote:

> Hi,
>
> I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
> order to get json parsing. I have an incredibly unusual error that I see
> with certain gzip compressed files. It's probably easiest to show you a pig
> session:
>
> grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
> grunt> register '/home/joe/json-simple-1.1.jar';
> grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
> TextLoader() as (line: chararray);
> grunt> X = FOREACH apiHits GENERATE line,
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
> grunt> Y = LIMIT X 2;
> grunt> dump Y;
> (succeeds, and I get what I expect).
>
> Now, if I try to do a projection using the json field, I get the following:
>
> grunt> A = FILTER X BY
> >>   json#'logtype' == 'foo'
> >>   OR json#'consumer' == 'foo1'
> >>   OR json#'consumer' == 'foo2'
> >>   OR json#'consumer' == 'foo3'
> >>   OR json#'consumer' == 'foo4'
> >>   ;
> grunt> B = LIMIT A 2;
> grunt> dump B;
>
> ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: java.lang.Long
> cannot be cast to org.json.simple.JSONObject
>
> And in the task tracker logs, the stack trace suggests that the json udf
> is seeing compressed data [1]. Does anyone have any ideas how to debug
> this, or guesses to what the problem is? Can I somehow determine if hadoop
> is actually decompressing the data or not?
>
> Thanks!
> Joe
>
> [1]
>
> 2012-04-05 14:39:20,211 WARN 
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not json-decode 
> string:  � ���
> Unexpected character ( ) at position 0.
>       at org.json.simple.parser.Yylex.yylex(Unknown Source)
>       at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
>       at org.json.simple.parser.JSONParser.parse(Unknown Source)
>       at org.json.simple.parser.JSONParser.parse(Unknown Source)
>       at org.json.simple.parser.JSONParser.parse(Unknown Source)
>       at 
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
>       at 
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
>       at 
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>       at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>       at org.apache.hadoop.mapred.Child.main(Child.java:264)
>
>
>

Reply via email to