Using Avro for encoding messages

Daniel Schierbeck Thu, 09 Jul 2015 01:37:26 -0700

I'm working on a system that will store Avro-encoded messages in Kafka. The
system will have both producers and consumers in different languages,
including Ruby (not JRuby) and Java.


At the moment I'm encoding each message as a data file, which means that
the full schema is included in each encoded message. This is obviously
suboptimal, but it doesn't seem like there's a standardized format for
single-message Avro encodings.

I've reviewed Confluent's schema-registry offering, but that seems to be
overkill for my needs, and would require me to run and maintain yet another
piece of infrastructure. Ideally, I wouldn't have to use anything besides
Kafka.

Is this something that other people have experience with?

I've come up with a scheme that would seem to work well independently of
what kind of infrastructure you're using: whenever a writer process is
asked to encode a message m with schema s for the first time, it broadcasts
(s', s) to a schema registry, where s' is the fingerprint of s. The schema
registry in this case can be pluggable, and can be any mechanism that
allows different processes to access the schemas. The writer then encodes
the message as (s', m), i.e. only includes the schema fingerprint. A
reader, when first encountering a message with a schema fingerprint s',
looks up s from the schema registry and uses s to decode the message.

Here, the concept of a schema registry has been abstracted away and is not
tied to the concept of "schema ids" and versions. Furthermore, there are
some desirable traits:

1. Schemas are identified by their fingerprints, so there's no need for an
external system to issue schema ids.
2. Writing (s', s) pairs is idempotent, so there's no need to coordinate
that task. If you've got a system with many writers, you can let all of
them broadcast their schemas when they boot or when they need to encode
data using the schemas.
3. It would work using a range of different backends for the schema
registry. Simple key-value stores would obviously work, but for my case I'd
probably want to use Kafka itself. If the schemas are writting to a topic
with key-based compaction, where s' is the message key and s is the message
value, then Kafka would automatically clean up duplicates over time. This
would save me from having to add more pieces to my infrastructure.

Has this problem been solved already? If not, would it make sense to define
a common "message format" that defined the structure of (s', m) pairs?

Cheers,
Daniel Schierbeck

Using Avro for encoding messages

Reply via email to