I'm working on a system that will store Avro-encoded messages in Kafka. The system will have both producers and consumers in different languages, including Ruby (not JRuby) and Java.
At the moment I'm encoding each message as a data file, which means that the full schema is included in each encoded message. This is obviously suboptimal, but it doesn't seem like there's a standardized format for single-message Avro encodings. I've reviewed Confluent's schema-registry offering, but that seems to be overkill for my needs, and would require me to run and maintain yet another piece of infrastructure. Ideally, I wouldn't have to use anything besides Kafka. Is this something that other people have experience with? I've come up with a scheme that would seem to work well independently of what kind of infrastructure you're using: whenever a writer process is asked to encode a message m with schema s for the first time, it broadcasts (s', s) to a schema registry, where s' is the fingerprint of s. The schema registry in this case can be pluggable, and can be any mechanism that allows different processes to access the schemas. The writer then encodes the message as (s', m), i.e. only includes the schema fingerprint. A reader, when first encountering a message with a schema fingerprint s', looks up s from the schema registry and uses s to decode the message. Here, the concept of a schema registry has been abstracted away and is not tied to the concept of "schema ids" and versions. Furthermore, there are some desirable traits: 1. Schemas are identified by their fingerprints, so there's no need for an external system to issue schema ids. 2. Writing (s', s) pairs is idempotent, so there's no need to coordinate that task. If you've got a system with many writers, you can let all of them broadcast their schemas when they boot or when they need to encode data using the schemas. 3. It would work using a range of different backends for the schema registry. Simple key-value stores would obviously work, but for my case I'd probably want to use Kafka itself. If the schemas are writting to a topic with key-based compaction, where s' is the message key and s is the message value, then Kafka would automatically clean up duplicates over time. This would save me from having to add more pieces to my infrastructure. Has this problem been solved already? If not, would it make sense to define a common "message format" that defined the structure of (s', m) pairs? Cheers, Daniel Schierbeck
