Thanks in advance for any help anyone can offer me with this issue. Forgive the
scattered presentation, I am new to Avro and am unsure of how to phrase the
exact issue I'm experiencing.
I have been using Java to read Avro messages off of a Kafka topic application
in a heterogeneous environment (the messages are put onto a Kafka queue by a Go
client written by a different team; I have no control over what is put onto the
wire or how).
My team controls the schema for the project; we have a .avsc schema file and
use the "com.commercehub.gradle.plugin.avro" Gradle plug-in to generate our
Java classes. Currently, I can deserialize messages into our auto-generated
classes effectively. However, the Utf8 data type used in lieu of String is a
pretty constant vexation. We use String quite often, CharSequence rarely and
Utf8 never. Hence, our code is cluttered with constant toString() calls on Utf8
objects which makes navigating and understanding the source very cumbersome.
The Gradle plugin in question has a "stringType" option which when set to
String, does indeed generate the domain classes with standard Strings instead
of the Utf8 objects, but when I attempt to de-serialize generic messages off
the queue, I receive the following Stack trace (we are using the Avro 1.7.7
Java library:
java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be cast to
java.lang.String at
com.employer.avro.schema.DomainObject.put(DomainObject.java:64) at
org.apache.avro.generic.GenericData.setField(GenericData.java:573) at
org.apache.avro.generic.GenericData.setField(GenericData.java:590) at
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
I am using a similar method to the following to read byte arrays that come off
the queue:
byte[] message = ReadTheQueueHere.....SpecificDatumReader<SomeType> = new
SpecificDatumReader<>(ourSchema);BinaryDecoder someTypeDecoder =
DecoderFactory.get().binaryDecoder(message, null);SomeType record =
specificDatumReader.read(null, someTypeDecoder); <-- Exception thrown here
Other things I have tried:
I tried setting the "avro.java.string": "String" property on all String fields
within the Schema. This actually did nothing; the problem still occurs. In
fact, the exception above occurs on a field number that is explicitly marked
with "avro.java.string": "String" in the schema file:
Java: public void put(int field$, java.lang.Object value$) {...case 3:
some_field = (java.lang.String)value$; break;
Schema file:..{ "name": "some_field", "type": "string",
"avro.java.string": "String"},..
In fact, when I toString() the schema, it shows up there too:
{"name":"some_field","type":{"type":"string","avro.java.string":"String"},"avro.java.string":"String"}
I tried using GenericData.setStringType( DomainObject.getClassSchema(),
StringType.String); This also did nothing (but see below).I tried setting the
String type on the schema as a whole: Schema schema = new
Schema.Parser().parse(this.getClass().getResourceAsStream("/ourSchema.avsc"));
GenericData.setStringType( schema, StringType.String);This resulted in the
following exception: org.apache.avro.AvroRuntimeException: Can't set properties
on a union: [Our Schema as a String]. I suppose this because of some subtlety
in our schema arising from our inexperience with the platform; from what little
poking I've done into this, our entire schema (with several record types within
it) is defined as a union, something I've found very little explanation of what
or how it is.
I worked up a small sample method to reproduce what I think is happening. I
believe that the Go end of things is writing generic records to Kafka (I don't
even know if this makes a difference).
public static SomeType encodeToSpecificRecord(GenericRecord someRecord, Schema
writerSchema, Schema readerSchema) throws Exception {
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(writerSchema);
ByteArrayOutputStream out = new ByteArrayOutputStream(); Encoder
encoder = EncoderFactory.get().binaryEncoder(out, null);
datumWriter.write(someRecord, encoder); encoder.flush(); byte[]
inBytes = out.toByteArray(); out.close();
SpecificDatumReader<SomeType> reader = new SpecificDatumReader<>(readerSchema);
Decoder decoder = DecoderFactory.get().binaryDecoder(inBytes, null);
SomeType result = reader.read(null, decoder); logger.debug("Read
record {}", result.toString());
return result; }
The only way that I can get this to work is if I do the
following:GenericData.setStringType(SomeType.getClassSchema(),
StringType.String)and call the method with the SomeType.getClassSchema() schema
for both reader and writer schema.
This will not do, because I am pretty sure that this is not how the Go end of
things is writing the data; I am pretty sure they are using the overarching
schema as read from the file. If I mismatch reader and writer schemas, I get
the following:
java.lang.ArrayIndexOutOfBoundsException: 9 at
org.apache.avro.io.ResolvingDecoder.readEnum(ResolvingDecoder.java:257)
Can anyone offer any assistance in detangling this problem? Constant
conversions of String <-> Utf8 clutters up the code, reduces efficiency and
impacts readability and maintainability.
Thank you very much!