Great response. From what I understand in your last response is that you are 
actually sending a wrapped Avro message to Kafka and all of your consumers know 
how to decode this wrapped message into two parts… a unique identifier (MD5) 
and the actual Avro message. Is that correct? If so, that answers question #1. 



On Aug 20, 2013, at 11:01 AM, Eric Wasserman <[email protected]> wrote:

> You may want to check out this Avro feature request: 
> https://issues.apache.org/jira/browse/AVRO-1124
> which has a lot of nice motivation and usage patterns. Unfortunately, its not 
> yet a resolved request.
> 
> There are really two broad use cases. 
> 
> 1) The data are "small" compared to the schema (perhaps because its a 
> collection or stream of records encoded by different schemas)
> 2) The data are "big" compared to the schema. (very big records or lots of 
> records that share a schema)
> 
> Case (1) is often a candidate for a schema registry. Case (2) not as much.
> 
> Examples from my own usage:
> 
> For Kafka we include an MD5 digest of the writer's schema with each Message. 
> It is serialized as a concatenation of the fixed-length MD5 and the binary 
> Avro-encoded data. To decode we read off the MD5, look up the schema and use 
> it to decode the remainder of the Message.
> [You could also segregate data written with different schemas into different 
> Kafka topics. By knowing which topic a message is under you then arrange a 
> way to look up the writer's schema. That lets you avoid even the cost of 
> including the MD5 in the Messages.]
> 
> In either case consumer code needs to look up the full schema from a 
> "registry" in order to do the actual decode the Avro-encoded data. The 
> registry serves the full schema that corresponds to the specified MD5 digest.
> 
> We use a similar technique for storing MD5-tagged Avro data in "columns" of 
> Cassandra and so on.
> 
> Case (2) is pretty well handled by just embedding the full schema itself.
> 
> For example, for Hadoop you can just use Avro data files which include the 
> actual schema in a header. All the record in the file then adhere to that 
> same schema. In this case using a registry to get the writer's schema is not 
> necessary.
> 
> Note: As described in the feature request linked above, some people use a 
> schema registry as a way of coordinating schema evolution rather than just as 
> a way of making schema access "economical".
> 
> 
> 
> On Aug 20, 2013, at 9:19 AM, Mark wrote:
> 
>> Can someone break down how message serialization would work with Avro and a 
>> schema registry? We are planning to use Avro with Kafka and I've read 
>> instead of adding a schema to every single event it would be wise to add 
>> some sort of fingerprint with each message to identify which schema it 
>> should used. What I'm having trouble understanding is, how do we read the 
>> fingerprint without a schema? Don't we need the schema to deserialize?  Same 
>> question goes for working with Hadoop.. how does the input format know which 
>> schema to use?
>> 
>> Thanks
> 
> 

Reply via email to