Utf8 to String ClassCastException upon deserialization, setting properties on Union schema

Ryon Day Fri, 28 Aug 2015 11:53:18 -0700

Thanks in advance for any help anyone can offer me with this issue. Forgive the 
scattered presentation, I am new to Avro and am unsure of how to phrase the 
exact issue I'm experiencing.
I have been using Java to read Avro messages off of a Kafka topic application 
in a heterogeneous environment (the messages are put onto a Kafka queue by a Go 
client written by a different team; I have no control over what is put onto the 
wire or how).
My team controls the schema for the project; we have a .avsc schema file and 
use the "com.commercehub.gradle.plugin.avro" Gradle plug-in to generate our 
Java classes. Currently, I can deserialize messages into our auto-generated 
classes effectively. However, the Utf8 data type used in lieu of String is a 
pretty constant vexation. We use String quite often, CharSequence rarely and 
Utf8 never. Hence, our code is cluttered with constant toString() calls on Utf8 
objects which makes navigating and understanding the source very cumbersome.
The Gradle plugin in question has a "stringType" option which when set to 
String, does indeed generate the domain classes with standard Strings instead 
of the Utf8 objects, but when I attempt to de-serialize generic messages off 
the queue, I receive the following Stack trace (we are using the Avro 1.7.7 
Java library:
java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be cast to 
java.lang.String at 
com.employer.avro.schema.DomainObject.put(DomainObject.java:64) at 
org.apache.avro.generic.GenericData.setField(GenericData.java:573) at 
org.apache.avro.generic.GenericData.setField(GenericData.java:590) at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
 at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)


I am using a similar method to the following to read byte arrays that come off 
the queue:
byte[] message = ReadTheQueueHere.....SpecificDatumReader<SomeType> = new 
SpecificDatumReader<>(ourSchema);BinaryDecoder someTypeDecoder = 
DecoderFactory.get().binaryDecoder(message, null);SomeType record = 
specificDatumReader.read(null, someTypeDecoder); <-- Exception thrown here
Other things I have tried: 
I tried setting the "avro.java.string": "String" property on all String fields 
within the Schema. This actually did nothing; the problem still occurs. In 
fact, the exception above occurs on a field number that is explicitly marked 
with "avro.java.string": "String" in the schema file: 
Java: public void put(int field$, java.lang.Object value$) {...case 3: 
some_field = (java.lang.String)value$; break;

Schema file:..{        "name": "some_field",        "type": "string",        
"avro.java.string": "String"},..
In fact, when I toString() the schema, it shows up there too: 
{"name":"some_field","type":{"type":"string","avro.java.string":"String"},"avro.java.string":"String"}
I tried using GenericData.setStringType( DomainObject.getClassSchema(), 
StringType.String); This also did nothing (but see below).I tried setting the 
String type on the schema as a whole: Schema schema = new 
Schema.Parser().parse(this.getClass().getResourceAsStream("/ourSchema.avsc"));
 GenericData.setStringType( schema, StringType.String);This resulted in the 
following exception: org.apache.avro.AvroRuntimeException: Can't set properties 
on a union: [Our Schema as a String]. I suppose this because of some subtlety 
in our schema arising from our inexperience with the platform; from what little 
poking I've done into this, our entire schema (with several record types within 
it) is defined as a union, something I've found very little explanation of what 
or how it is. 
I worked up a small sample method to reproduce what I think is happening. I 
believe that the Go end of things is writing generic records to Kafka (I don't 
even know if this makes a difference).
public static SomeType encodeToSpecificRecord(GenericRecord someRecord, Schema 
writerSchema, Schema readerSchema) throws Exception {        
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(writerSchema);
        ByteArrayOutputStream out = new ByteArrayOutputStream();        Encoder 
encoder = EncoderFactory.get().binaryEncoder(out, null);        
datumWriter.write(someRecord, encoder);        encoder.flush();        byte[] 
inBytes = out.toByteArray();        out.close();              
SpecificDatumReader<SomeType> reader = new SpecificDatumReader<>(readerSchema); 
       Decoder decoder = DecoderFactory.get().binaryDecoder(inBytes, null);     
   SomeType result = reader.read(null, decoder);        logger.debug("Read 
record {}", result.toString());
        return result;    }
The only way that I can get this to work is if I do the 
following:GenericData.setStringType(SomeType.getClassSchema(), 
StringType.String)and call the method with the SomeType.getClassSchema() schema 
for both reader and writer schema.
This will not do, because I am pretty sure that this is not how the Go end of 
things is writing the data; I am pretty sure they are using the overarching 
schema as read from the file. If I mismatch reader and writer schemas, I get 
the following: 
java.lang.ArrayIndexOutOfBoundsException: 9 at 
org.apache.avro.io.ResolvingDecoder.readEnum(ResolvingDecoder.java:257)
Can anyone offer any assistance in detangling this problem? Constant 
conversions of String <-> Utf8 clutters up the code, reduces efficiency and 
impacts readability and maintainability. 
Thank you very much!

Utf8 to String ClassCastException upon deserialization, setting properties on Union schema

Reply via email to