OK, I think I have an explanation.....

"1. is it legitimate to load a Pair schema, or I should I be loading a schema 
that just consists of my generic record?"

- yes, that works for me with a Pair schema.

"2. why is Avro unable to load data it produced itself, seeing as 
AvroStorage.getSchema reads out the same schema from my input avro data file as 
the basis for parsing the input?"

The problem was that I had created my avro schema by reflection from a thrift 
entity, and in doing so had set String property in the schema field to 
"avro.java.string" instead of "avro.util.Utf8" (although I was writing Utf8). 
When AvroStorage reads this data, the PigAvroDatumReader.read method calls 
GenericDatumReader.readString, which decides - on the basis of the property 
mentioned above - whether to read the String value from the Decoder (which 
reads Utf8 but returns the .toString() result, causing a cast exception when 
attempting to cast back to Utf8), or the Utf8 value.

i.e. it seems from the explicit cast in PigAvroDatumReader.readString, that 
only Utf8 values in string fields can be procesed by AvroStorage.

Andrew



>________________________________
> From: Andrew Kenworthy <[email protected]>
>To: "[email protected]" <[email protected]> 
>Sent: Wednesday, November 16, 2011 3:25 PM
>Subject: Difficulties loading avro-generated data with AvroStorage
> 
>Hallo,
>
>I'm a little confused as to how to load avro data into pig using AvroStorage. 
>I have a map-reduce job that writes an AvroKey<Long>/AvroValue<GenericRecord> 
>K/V pair, producing a schema that looks like this:
>
>{ "fields" : [ { "doc" : "",
>        "name" : "key",
>        "type" : "long"
>      },
>      { "doc" : "",
>        "name" : "value",
>        "order" : "ignore",
>        "type" : { "fields" : [ { "name" : "logid",
>                  "type" : "long"
>                }
>{ "name" : "my_data",
>                 
>"type" : [ "null",
>                     
>{ "avro.java.string" : "String",
>                       
>"type" : "map",
>                       
>"values" : [ "null",
>                           
>{ "avro.java.string" : "String",
>                             
>"type" : "string"
>                           
>}
>                         
>]
>                     
>}
>                   
>]
>               
>},, {...} etc.etc. ],
>            "name" : "my_log",
>            "namespace" : "x.y.z.log.avro",
>            "type" : "record"
>          }
>      }
>    ],
>  "name" : "Pair",
>  "namespace" : "org.apache.avro.mapred",
>  "type" : "record"
>}
>
>i.e. a Pair schema including an avro.java.string field. When I load a datafile 
>with this schema using AvroStorage, I get the following exception:
>
>java.io.IOException:
>java.lang.ClassCastException: java.lang.String cannot be cast to
>org.apache.avro.util.Utf8
>    at 
>org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:251)
>    at
>org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
>    at
>org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
>    at
>org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>    at
>org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>    at
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
>org.apache.avro.util.Utf8
>    at 
>org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readString(PigAvroDatumReader.java:154)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>    at 
>org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
>    at
>org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
>    at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:135)
>    at
>org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
>    at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
>    at
>org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
>    at
>org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:249)
>which seems to be because my avro string value cannot be cast to an avro utf8 
>object.
>My questions are:
>1. is it legitimate to load a Pair schema, or I should I be loading a schema 
>that just consists of my generic record?
>2. why is Avro unable to load data it produced itself, seeing as 
>AvroStorage.getSchema reads out the same schema from my input avro data file 
>as the basis for parsing the input?
>(I'm sorry I can't be more specific....it's difficult to debug this and so I'm 
>guessing at the cause).
>regards,
>Andrew Kenworthy
>
>

Reply via email to