Avro schema evolution with Beam

Patrick Lucas Thu, 15 Jul 2021 02:57:55 -0700

Hello,

I'm building a streaming data platform using Beam on Dataflow, Pub/Sub, and
Avro for message schemas and serialization.


My plan is to maintain full schema compatibility for messages on the same
topic, but because Avro requires the writer schema in order to deserialize
and convert between compatible schema versions, the encoded input/output
messages need to use Avro's single object encoding (or an equivalent
mechanism) which includes a schema fingerprint that can be dereferenced in
a schema registry or cache at runtime to deserialize and convert the
payload.

Does Beam have any built-in support for this pattern? PubsubIO, for
example, uses AvroCoder, which appears to have no support for this (though
it may have in the past, based on some Git archaeology), using Avro binary
encoding which does not include the header.

If not, how do other users handle schema changes in their input data
streams? Do you just avoid them altogether, migrating consumers on each
schema change, or do you solve it manually with the above pattern?

Tangentially: what about state schema evolution? I've had trouble finding
any documentation about how to tackle this, whether for AvroCoder-coded
state or when using a Beam row schema.

Thanks,
Patrick Lucas

Avro schema evolution with Beam

Reply via email to