Hallo,

I'm a little confused as to how to load avro data into pig using AvroStorage. I 
have a map-reduce job that writes an AvroKey<Long>/AvroValue<GenericRecord> K/V 
pair, producing a schema that looks like this:

{ "fields" : [ { "doc" : "",
        "name" : "key",
        "type" : "long"
      },
      { "doc" : "",
        "name" : "value",
        "order" : "ignore",
        "type" : { "fields" : [ { "name" : "logid",
                  "type" : "long"
                }
{ "name" : "my_data",
                 
"type" : [ "null",
                     
{ "avro.java.string" : "String",
                       
"type" : "map",
                       
"values" : [ "null",
                           
{ "avro.java.string" : "String",
                             
"type" : "string"
                           
}
                         
]
                     
}
                   
]
               
},, {...} etc.etc. ],
            "name" : "my_log",
            "namespace" : "x.y.z.log.avro",
            "type" : "record"
          }
      }
    ],
  "name" : "Pair",
  "namespace" : "org.apache.avro.mapred",
  "type" : "record"
}

i.e. a Pair schema including an avro.java.string field. When I load a datafile 
with this schema using AvroStorage, I get the following exception:

java.io.IOException:
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.avro.util.Utf8
    at 
org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:251)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
    at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.avro.util.Utf8
    at 
org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readString(PigAvroDatumReader.java:154)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
    at 
org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
    at
org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
    at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:135)
    at
org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
    at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
    at
org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
    at
org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:249)
which seems to be because my avro string value cannot be cast to an avro utf8 
object.
My questions are:
1. is it legitimate to load a Pair schema, or I should I be loading a schema 
that just consists of my generic record?
2. why is Avro unable to load data it produced itself, seeing as 
AvroStorage.getSchema reads out the same schema from my input avro data file as 
the basis for parsing the input?
(I'm sorry I can't be more specific....it's difficult to debug this and so I'm 
guessing at the cause).
regards,
Andrew Kenworthy

Reply via email to