Joe, we'd be happy to take a pull request that addresses this cast exception and maybe increments a counter
On Mon, Apr 9, 2012 at 2:27 PM, Joe Crobak <[email protected]> wrote: > Hi Norbert, > > In some cases, I actually get a ClassCastException, which I guess are the > eventual cause of the job failures: > > java.lang.ClassCastException: java.lang.Long cannot be cast to > org.json.simple.JSONObject > at > com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:52) > at > com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:42) > at > com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:22) > > (Note that I switched back to the 2.1.11 tag, so the stack trace > corresponds to > https://github.com/kevinweil/elephant-bird/blob/b300849f6d014aaac520e385a34aa37adb53b5fa/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java) > > I've put together a dummy heuristic to skip lines that don't match > ^\\{.*\\}$ and this seems to get me past the CCE. > > Thanks for the info, though, I clearly missed the logging that you pointed > out. > > Joe > > > > On Mon, Apr 9, 2012 at 4:36 PM, Norbert Burger > <[email protected]>wrote: > >> So in this case, it seems like JsonStringToMap is properly catching the >> parse exception; in fact, it's the catch clause of the UDF that's >> generating the "Could not json-decode string" message in your task tracker >> logs. >> >> Take a look at line 63 here: >> >> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java >> >> When a parse exception happens, the UDF returns a null. Are you filtering >> out nulls before trying to project? >> >> Norbert >> >> On Mon, Apr 9, 2012 at 3:41 PM, Joe Crobak <[email protected]> wrote: >> >> > so it turns out our uncompressed data contains corrupted rows. Is there a >> > way to easily tell wrap the JsonStringToMap UDF to catch exceptions on >> > unparsable lines and just skip them? >> > >> > On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <[email protected]> wrote: >> > >> > > Hi, >> > > >> > > I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in >> > > order to get json parsing. I have an incredibly unusual error that I >> see >> > > with certain gzip compressed files. It's probably easiest to show you a >> > pig >> > > session: >> > > >> > > grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar'; >> > > grunt> register '/home/joe/json-simple-1.1.jar'; >> > > grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING >> > > TextLoader() as (line: chararray); >> > > grunt> X = FOREACH apiHits GENERATE line, >> > > com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json; >> > > grunt> Y = LIMIT X 2; >> > > grunt> dump Y; >> > > (succeeds, and I get what I expect). >> > > >> > > Now, if I try to do a projection using the json field, I get the >> > following: >> > > >> > > grunt> A = FILTER X BY >> > > >> json#'logtype' == 'foo' >> > > >> OR json#'consumer' == 'foo1' >> > > >> OR json#'consumer' == 'foo2' >> > > >> OR json#'consumer' == 'foo3' >> > > >> OR json#'consumer' == 'foo4' >> > > >> ; >> > > grunt> B = LIMIT A 2; >> > > grunt> dump B; >> > > >> > > ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: >> > java.lang.Long >> > > cannot be cast to org.json.simple.JSONObject >> > > >> > > And in the task tracker logs, the stack trace suggests that the json >> udf >> > > is seeing compressed data [1]. Does anyone have any ideas how to debug >> > > this, or guesses to what the problem is? Can I somehow determine if >> > hadoop >> > > is actually decompressing the data or not? >> > > >> > > Thanks! >> > > Joe >> > > >> > > [1] >> > > >> > > 2012-04-05 14:39:20,211 WARN >> > com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not >> > json-decode string: � ��� >> > > Unexpected character ( ) at position 0. >> > > at org.json.simple.parser.Yylex.yylex(Unknown Source) >> > > at org.json.simple.parser.JSONParser.nextToken(Unknown Source) >> > > at org.json.simple.parser.JSONParser.parse(Unknown Source) >> > > at org.json.simple.parser.JSONParser.parse(Unknown Source) >> > > at org.json.simple.parser.JSONParser.parse(Unknown Source) >> > > at >> > >> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63) >> > > at >> > >> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53) >> > > at >> > >> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) >> > > at >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) >> > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >> > > at >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) >> > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) >> > > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) >> > > at java.security.AccessController.doPrivileged(Native Method) >> > > at javax.security.auth.Subject.doAs(Subject.java:396) >> > > at >> > >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) >> > > at org.apache.hadoop.mapred.Child.main(Child.java:264) >> > > >> > > >> > > >> > >>
