Hi Norbert,

In some cases, I actually get a ClassCastException, which I guess are the
eventual cause of the job failures:

java.lang.ClassCastException: java.lang.Long cannot be cast to
org.json.simple.JSONObject
        at 
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:52)
        at 
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:42)
        at 
com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:22)

(Note that I switched back to the 2.1.11 tag, so the stack trace
corresponds to
https://github.com/kevinweil/elephant-bird/blob/b300849f6d014aaac520e385a34aa37adb53b5fa/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java)

I've put together a dummy heuristic to skip lines that don't match
^\\{.*\\}$ and this seems to get me past the CCE.

Thanks for the info, though, I clearly missed the logging that you pointed
out.

Joe



On Mon, Apr 9, 2012 at 4:36 PM, Norbert Burger <[email protected]>wrote:

> So in this case, it seems like JsonStringToMap is properly catching the
> parse exception; in fact, it's the catch clause of the UDF that's
> generating the "Could not json-decode string" message in your task tracker
> logs.
>
> Take a look at line 63 here:
>
> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
>
> When a parse exception happens, the UDF returns a null.  Are you filtering
> out nulls before trying to project?
>
> Norbert
>
> On Mon, Apr 9, 2012 at 3:41 PM, Joe Crobak <[email protected]> wrote:
>
> > so it turns out our uncompressed data contains corrupted rows. Is there a
> > way to easily tell wrap the JsonStringToMap UDF to catch exceptions on
> > unparsable lines and just skip them?
> >
> > On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
> > > order to get json parsing. I have an incredibly unusual error that I
> see
> > > with certain gzip compressed files. It's probably easiest to show you a
> > pig
> > > session:
> > >
> > > grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
> > > grunt> register '/home/joe/json-simple-1.1.jar';
> > > grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
> > > TextLoader() as (line: chararray);
> > > grunt> X = FOREACH apiHits GENERATE line,
> > > com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
> > > grunt> Y = LIMIT X 2;
> > > grunt> dump Y;
> > > (succeeds, and I get what I expect).
> > >
> > > Now, if I try to do a projection using the json field, I get the
> > following:
> > >
> > > grunt> A = FILTER X BY
> > > >>   json#'logtype' == 'foo'
> > > >>   OR json#'consumer' == 'foo1'
> > > >>   OR json#'consumer' == 'foo2'
> > > >>   OR json#'consumer' == 'foo3'
> > > >>   OR json#'consumer' == 'foo4'
> > > >>   ;
> > > grunt> B = LIMIT A 2;
> > > grunt> dump B;
> > >
> > > ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR:
> > java.lang.Long
> > > cannot be cast to org.json.simple.JSONObject
> > >
> > > And in the task tracker logs, the stack trace suggests that the json
> udf
> > > is seeing compressed data [1]. Does anyone have any ideas how to debug
> > > this, or guesses to what the problem is? Can I somehow determine if
> > hadoop
> > > is actually decompressing the data or not?
> > >
> > > Thanks!
> > > Joe
> > >
> > > [1]
> > >
> > > 2012-04-05 14:39:20,211 WARN
> > com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
> > json-decode string:  � ���
> > > Unexpected character ( ) at position 0.
> > >       at org.json.simple.parser.Yylex.yylex(Unknown Source)
> > >       at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> > >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> > >       at
> >
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
> > >       at
> >
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
> > >       at
> >
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
> > >       at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> > >       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> > >       at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> > >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> > >       at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> > >       at java.security.AccessController.doPrivileged(Native Method)
> > >       at javax.security.auth.Subject.doAs(Subject.java:396)
> > >       at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> > >       at org.apache.hadoop.mapred.Child.main(Child.java:264)
> > >
> > >
> > >
> >
>

Reply via email to