Thanks for the replies, Originally I wanted to have a Kafka topic with multiple schema types, but Ken's approach sounds like it could work well so I will try out the single schema approach with a big union type at the root of the schema.
Josh On Tue, Nov 15, 2016 at 4:44 PM, John McClean <[email protected]> wrote: > One approach is to have separate Kafka topics per schema, which evolve > with use of a schema registry: https://github.com/ > confluentinc/schema-registry. You'd write to the topic with the schema id > in metadata. You'd write normal avro storage files, knowing when to split > them based on the changing schema id in the kafka message. > > On Tue, Nov 15, 2016 at 2:24 AM, Josh <[email protected]> wrote: > >> Hi all, >> >> I am using a typical Avro->Kafka solution where data is serialized to >> Avro before it gets written to Kafka and each message is prepended with a >> schema ID which can be looked up in my schema repository. >> >> Now, I want to store the data in long-term storage by writing data from >> Kafka->S3. >> >> I know that the usual way to store Avro in storage is using Avro >> container files, however a container file can only contain messages encoded >> with a single Avro schema. In my case, the messages may be encoded with >> difference schemas, and I need to retain the order of the messages (so that >> they can be replayed into Kafka, in order). Therefore, a single file in S3 >> needs to contain messages encoded with different schemas and so I can't use >> Avro container files. >> >> I was wondering what would be a good solution to this? What format could >> I use to store my Avro data, such that a single data file can contain >> messages encoded with different schemas? Should I store the messages with a >> prepended schema ID, similar to what I do in Kafka? In that case, how could >> I separate the messages in the file? >> >> Thanks for any advice, >> Josh >> > >
