Hallo,
I'm a little confused as to how to load avro data into pig using AvroStorage. I
have a map-reduce job that writes an AvroKey<Long>/AvroValue<GenericRecord> K/V
pair, producing a schema that looks like this:
{ "fields" : [ { "doc" : "",
"name" : "key",
"type" : "long"
},
{ "doc" : "",
"name" : "value",
"order" : "ignore",
"type" : { "fields" : [ { "name" : "logid",
"type" : "long"
}
{ "name" : "my_data",
"type" : [ "null",
{ "avro.java.string" : "String",
"type" : "map",
"values" : [ "null",
{ "avro.java.string" : "String",
"type" : "string"
}
]
}
]
},, {...} etc.etc. ],
"name" : "my_log",
"namespace" : "x.y.z.log.avro",
"type" : "record"
}
}
],
"name" : "Pair",
"namespace" : "org.apache.avro.mapred",
"type" : "record"
}
i.e. a Pair schema including an avro.java.string field. When I load a datafile
with this schema using AvroStorage, I get the following exception:
java.io.IOException:
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.avro.util.Utf8
at
org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:251)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:455)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.avro.util.Utf8
at
org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readString(PigAvroDatumReader.java:154)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
at
org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
at
org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:135)
at
org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
at
org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
at
org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:249)
which seems to be because my avro string value cannot be cast to an avro utf8
object.
My questions are:
1. is it legitimate to load a Pair schema, or I should I be loading a schema
that just consists of my generic record?
2. why is Avro unable to load data it produced itself, seeing as
AvroStorage.getSchema reads out the same schema from my input avro data file as
the basis for parsing the input?
(I'm sorry I can't be more specific....it's difficult to debug this and so I'm
guessing at the cause).
regards,
Andrew Kenworthy