Josh, My understanding of Avro containers is that they include the schema (self-contained) and the reader will get the schema from the container. I use this technique for Kafka, not Avro containers, so I avoid the problem of ‘sealing’ the schema inside the container, but I need to publish the schema for use by others.
Appending a new type of message probably requires duplication of an existing container. Avro unions are backward compatible when appending a new type. That allows my Kafka clients to read older messages with newer unions. -Ken. From: Josh [mailto:[email protected]] Sent: 15 November 2016 12:46 To: [email protected] Subject: Re: Alternative to Avro container files for long-term Avro storage Hi Ken, Thanks for the reply - that does sound like a good idea, however I don't think it will work well for me - as I don't have a fixed number of message types. In my case there could potentially be new message types added every day and the union could grow to contain hundreds of message types. It also sounds tricky to manage the union when adding new message types. (i.e. making sure readers' schemas are updated first) If there's a nice way to do it, I'd like to find a way that doesn't involve Avro container files, so that I can maintain a separate Avro schema per message type. Josh On Tue, Nov 15, 2016 at 12:21 PM, Jarrad, Ken <[email protected]<mailto:[email protected]>> wrote: Josh, I use method createUnion on class org.apache.avro.Schema. The mixed message types then have the union as their common type and are thus homogeneous. Yours sincerely, Ken Jarrad. From: Josh [mailto:[email protected]<mailto:[email protected]>] Sent: 15 November 2016 10:24 To: [email protected]<mailto:[email protected]> Subject: Alternative to Avro container files for long-term Avro storage Hi all, I am using a typical Avro->Kafka solution where data is serialized to Avro before it gets written to Kafka and each message is prepended with a schema ID which can be looked up in my schema repository. Now, I want to store the data in long-term storage by writing data from Kafka->S3. I know that the usual way to store Avro in storage is using Avro container files, however a container file can only contain messages encoded with a single Avro schema. In my case, the messages may be encoded with difference schemas, and I need to retain the order of the messages (so that they can be replayed into Kafka, in order). Therefore, a single file in S3 needs to contain messages encoded with different schemas and so I can't use Avro container files. I was wondering what would be a good solution to this? What format could I use to store my Avro data, such that a single data file can contain messages encoded with different schemas? Should I store the messages with a prepended schema ID, similar to what I do in Kafka? In that case, how could I separate the messages in the file? Thanks for any advice, Josh
