Great response. From what I understand in your last response is that you are actually sending a wrapped Avro message to Kafka and all of your consumers know how to decode this wrapped message into two parts… a unique identifier (MD5) and the actual Avro message. Is that correct? If so, that answers question #1.
On Aug 20, 2013, at 11:01 AM, Eric Wasserman <[email protected]> wrote: > You may want to check out this Avro feature request: > https://issues.apache.org/jira/browse/AVRO-1124 > which has a lot of nice motivation and usage patterns. Unfortunately, its not > yet a resolved request. > > There are really two broad use cases. > > 1) The data are "small" compared to the schema (perhaps because its a > collection or stream of records encoded by different schemas) > 2) The data are "big" compared to the schema. (very big records or lots of > records that share a schema) > > Case (1) is often a candidate for a schema registry. Case (2) not as much. > > Examples from my own usage: > > For Kafka we include an MD5 digest of the writer's schema with each Message. > It is serialized as a concatenation of the fixed-length MD5 and the binary > Avro-encoded data. To decode we read off the MD5, look up the schema and use > it to decode the remainder of the Message. > [You could also segregate data written with different schemas into different > Kafka topics. By knowing which topic a message is under you then arrange a > way to look up the writer's schema. That lets you avoid even the cost of > including the MD5 in the Messages.] > > In either case consumer code needs to look up the full schema from a > "registry" in order to do the actual decode the Avro-encoded data. The > registry serves the full schema that corresponds to the specified MD5 digest. > > We use a similar technique for storing MD5-tagged Avro data in "columns" of > Cassandra and so on. > > Case (2) is pretty well handled by just embedding the full schema itself. > > For example, for Hadoop you can just use Avro data files which include the > actual schema in a header. All the record in the file then adhere to that > same schema. In this case using a registry to get the writer's schema is not > necessary. > > Note: As described in the feature request linked above, some people use a > schema registry as a way of coordinating schema evolution rather than just as > a way of making schema access "economical". > > > > On Aug 20, 2013, at 9:19 AM, Mark wrote: > >> Can someone break down how message serialization would work with Avro and a >> schema registry? We are planning to use Avro with Kafka and I've read >> instead of adding a schema to every single event it would be wise to add >> some sort of fingerprint with each message to identify which schema it >> should used. What I'm having trouble understanding is, how do we read the >> fingerprint without a schema? Don't we need the schema to deserialize? Same >> question goes for working with Hadoop.. how does the input format know which >> schema to use? >> >> Thanks > >
