Josh, I use method createUnion on class org.apache.avro.Schema.

The mixed message types then have the union as their common type and are thus 
homogeneous.

Yours sincerely,
Ken Jarrad.

From: Josh [mailto:[email protected]]
Sent: 15 November 2016 10:24
To: [email protected]
Subject: Alternative to Avro container files for long-term Avro storage

Hi all,

I am using a typical Avro->Kafka solution where data is serialized to Avro 
before it gets written to Kafka and each message is prepended with a schema ID 
which can be looked up in my schema repository.

Now, I want to store the data in long-term storage by writing data from 
Kafka->S3.

I know that the usual way to store Avro in storage is using Avro container 
files, however a container file can only contain messages encoded with a single 
Avro schema. In my case, the messages may be encoded with difference schemas, 
and I need to retain the order of the messages (so that they can be replayed 
into Kafka, in order). Therefore, a single file in S3 needs to contain messages 
encoded with different schemas and so I can't use Avro container files.

I was wondering what would be a good solution to this? What format could I use 
to store my Avro data, such that a single data file can contain messages 
encoded with different schemas? Should I store the messages with a prepended 
schema ID, similar to what I do in Kafka? In that case, how could I separate 
the messages in the file?

Thanks for any advice,
Josh

Reply via email to