I had the same problem a while ago and for the same reasons as you mention we decided to use fingerprints (MD5 hash of the schema), however there are some catches here.
First I believe that the normalisation of the schema is incomplete so you might end up with different hashes of the same schema. Second, using a 128 bit integer prepended to both key and values takes more space than using 32 bit. Not a big issue for values but for keys this doubles our size. Third, we already started to use confluent's registry as well because of the already existing integration with other pieces of infrastructure. (camus, bottledwater etc.) What should be useful given this perspective is a byte or two prepending the schema id - defining the registry namespace. I've added the fingerprint schema registry as a example in the c++ kafka library at https://github.com/bitbouncer/csi-kafka/tree/master/examples/schema-registry We run a couple of those in a mesos cluster and use HAproxy find them. /svante 2015-07-09 10:36 GMT+02:00 Daniel Schierbeck <[email protected]>: > I'm working on a system that will store Avro-encoded messages in Kafka. > The system will have both producers and consumers in different languages, > including Ruby (not JRuby) and Java. > > At the moment I'm encoding each message as a data file, which means that > the full schema is included in each encoded message. This is obviously > suboptimal, but it doesn't seem like there's a standardized format for > single-message Avro encodings. > > I've reviewed Confluent's schema-registry offering, but that seems to be > overkill for my needs, and would require me to run and maintain yet another > piece of infrastructure. Ideally, I wouldn't have to use anything besides > Kafka. > > Is this something that other people have experience with? > > I've come up with a scheme that would seem to work well independently of > what kind of infrastructure you're using: whenever a writer process is > asked to encode a message m with schema s for the first time, it broadcasts > (s', s) to a schema registry, where s' is the fingerprint of s. The schema > registry in this case can be pluggable, and can be any mechanism that > allows different processes to access the schemas. The writer then encodes > the message as (s', m), i.e. only includes the schema fingerprint. A > reader, when first encountering a message with a schema fingerprint s', > looks up s from the schema registry and uses s to decode the message. > > Here, the concept of a schema registry has been abstracted away and is not > tied to the concept of "schema ids" and versions. Furthermore, there are > some desirable traits: > > 1. Schemas are identified by their fingerprints, so there's no need for an > external system to issue schema ids. > 2. Writing (s', s) pairs is idempotent, so there's no need to coordinate > that task. If you've got a system with many writers, you can let all of > them broadcast their schemas when they boot or when they need to encode > data using the schemas. > 3. It would work using a range of different backends for the schema > registry. Simple key-value stores would obviously work, but for my case I'd > probably want to use Kafka itself. If the schemas are writting to a topic > with key-based compaction, where s' is the message key and s is the message > value, then Kafka would automatically clean up duplicates over time. This > would save me from having to add more pieces to my infrastructure. > > Has this problem been solved already? If not, would it make sense to > define a common "message format" that defined the structure of (s', m) > pairs? > > Cheers, > Daniel Schierbeck >
