so it turns out our uncompressed data contains corrupted rows. Is there a way to easily tell wrap the JsonStringToMap UDF to catch exceptions on unparsable lines and just skip them?
On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <[email protected]> wrote: > Hi, > > I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in > order to get json parsing. I have an incredibly unusual error that I see > with certain gzip compressed files. It's probably easiest to show you a pig > session: > > grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar'; > grunt> register '/home/joe/json-simple-1.1.jar'; > grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING > TextLoader() as (line: chararray); > grunt> X = FOREACH apiHits GENERATE line, > com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json; > grunt> Y = LIMIT X 2; > grunt> dump Y; > (succeeds, and I get what I expect). > > Now, if I try to do a projection using the json field, I get the following: > > grunt> A = FILTER X BY > >> json#'logtype' == 'foo' > >> OR json#'consumer' == 'foo1' > >> OR json#'consumer' == 'foo2' > >> OR json#'consumer' == 'foo3' > >> OR json#'consumer' == 'foo4' > >> ; > grunt> B = LIMIT A 2; > grunt> dump B; > > ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: java.lang.Long > cannot be cast to org.json.simple.JSONObject > > And in the task tracker logs, the stack trace suggests that the json udf > is seeing compressed data [1]. Does anyone have any ideas how to debug > this, or guesses to what the problem is? Can I somehow determine if hadoop > is actually decompressing the data or not? > > Thanks! > Joe > > [1] > > 2012-04-05 14:39:20,211 WARN > com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not json-decode > string: � ��� > Unexpected character ( ) at position 0. > at org.json.simple.parser.Yylex.yylex(Unknown Source) > at org.json.simple.parser.JSONParser.nextToken(Unknown Source) > at org.json.simple.parser.JSONParser.parse(Unknown Source) > at org.json.simple.parser.JSONParser.parse(Unknown Source) > at org.json.simple.parser.JSONParser.parse(Unknown Source) > at > com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63) > at > com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53) > at > com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > >
