RE: Alternative to Avro container files for long-term Avro storage

Jarrad, Ken Tue, 15 Nov 2016 05:45:15 -0800

Josh,

My understanding of Avro containers is that they include the schema 
(self-contained) and the reader will get the schema from the container.
I use this technique for Kafka, not Avro containers, so I avoid the problem of 
‘sealing’ the schema inside the container, but I need to publish the schema for 
use by others.


Appending a new type of message probably requires duplication of an existing 
container.
Avro unions are backward compatible when appending a new type.
That allows my Kafka clients to read older messages with newer unions.

-Ken.

From: Josh [mailto:[email protected]]
Sent: 15 November 2016 12:46
To: [email protected]
Subject: Re: Alternative to Avro container files for long-term Avro storage

Hi Ken,

Thanks for the reply - that does sound like a good idea, however I don't think 
it will work well for me - as I don't have a fixed number of message types. In 
my case there could potentially be new message types added every day and the 
union could grow to contain hundreds of message types. It also sounds tricky to 
manage the union when adding new message types. (i.e. making sure readers' 
schemas are updated first)

If there's a nice way to do it, I'd like to find a way that doesn't involve 
Avro container files, so that I can maintain a separate Avro schema per message 
type.

Josh


On Tue, Nov 15, 2016 at 12:21 PM, Jarrad, Ken 
<[email protected]<mailto:[email protected]>> wrote:
Josh, I use method createUnion on class org.apache.avro.Schema.

The mixed message types then have the union as their common type and are thus 
homogeneous.

Yours sincerely,
Ken Jarrad.

From: Josh [mailto:[email protected]<mailto:[email protected]>]
Sent: 15 November 2016 10:24
To: [email protected]<mailto:[email protected]>
Subject: Alternative to Avro container files for long-term Avro storage

Hi all,

I am using a typical Avro->Kafka solution where data is serialized to Avro 
before it gets written to Kafka and each message is prepended with a schema ID 
which can be looked up in my schema repository.

Now, I want to store the data in long-term storage by writing data from 
Kafka->S3.

I know that the usual way to store Avro in storage is using Avro container 
files, however a container file can only contain messages encoded with a single 
Avro schema. In my case, the messages may be encoded with difference schemas, 
and I need to retain the order of the messages (so that they can be replayed 
into Kafka, in order). Therefore, a single file in S3 needs to contain messages 
encoded with different schemas and so I can't use Avro container files.

I was wondering what would be a good solution to this? What format could I use 
to store my Avro data, such that a single data file can contain messages 
encoded with different schemas? Should I store the messages with a prepended 
schema ID, similar to what I do in Kafka? In that case, how could I separate 
the messages in the file?

Thanks for any advice,
Josh

RE: Alternative to Avro container files for long-term Avro storage

Reply via email to