Hi Roger,
have you considered leveraging avro logical types, and keep the payload and
event metadata “separate”?
Here is a example (will use avro idl, since that is more readable to me :-) ):
record MetaData {
@logicalType(“instant") string timeStamp;
….. all the meta data fields...
}
record CloudEvent {
MetaData metaData;
Any payload;
}
@logicalType(“any")
record Any {
/** here you have the schema of the data, for efficiency, you can use a
schema id + schema repo, or something like
https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences
<https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences> */
string schema;
bytes data;
}
this way a system that is interested in the metadata does not even have to
deserialize the payload….
hope it helps.
—Z
> On Dec 18, 2019, at 11:49 AM, roger peppe <[email protected]> wrote:
>
> Hi,
>
> Background: I've been contemplating the proposed Avro format in the
> CloudEvent specification
> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
> defines standard metadata for events. It defines a very generic format for an
> event that allows storage of almost any data. It seems to me that by going in
> that direction it's losing almost all the advantages of using Avro in the
> first place. It feels like it's trying to shoehorn a dynamic message format
> like JSON into the Avro format, where using Avro itself could do so much
> better.
>
> I'm hoping to propose something better. I had what I thought was a nice idea,
> but it doesn't quite work, and I thought I'd bring up the subject here and
> see if anyone had some better ideas.
>
> The schema resolution
> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part of
> the spec allows a reader to read a schema that was written with extra fields.
> So, theoretically, we could define a CloudEvent something like this:
>
> {
> "name": "CloudEvent",
> "type": "record",
> "fields": [{
> "name": "Metadata",
> "type": {
> "type": "record",
> "name": "CloudEvent",
> "namespace": "avro.apache.org <http://avro.apache.org/>",
> "fields": [{
> "name": "id",
> "type": "string"
> }, {
> "name": "source",
> "type": "string"
> }, {
> "name": "time",
> "type": "long",
> "logicalType": "timestamp-micros"
> }]
> }
> }]
> }
>
> Theoretically, this could enable any event that's a record that has at least
> a Metadata field with the above fields to be read generically. The CloudEvent
> type above could be seen as a structural supertype of all possible
> more-specific CloudEvent-compatible records that have such a compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution
> rule: "both schemas are records with the same (unqualified) name". This means
> that unless everyone names all their CloudEvent-compatible records
> "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records "CloudEvent",
> so we have a problem.
>
> I can see a few possible workarounds:
> when reading the record as a CloudEvent, read it with a schema that's the
> same as CloudEvent, but with the top level record name changed to the top
> level name of the schema that was used to write the record.
> ignore record names when matching schema record types.
> allow aliases to be specified when writing data as well as reading it. When
> defining a CloudEvent-compatible event, you'd add a CloudEvent alias to your
> record.
> None of the options are particularly nice. 1 is probably the easiest to do,
> although means you'd still need some custom logic when decoding records,
> meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with union
> types. You could define the matching such that it ignores names only when the
> two matched types are unambiguous (i.e. only one record in both). This could
> be implemented as an option ("use structural typing") when decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for
> example, the canonical schema transformation strips aliases out, but they'd
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better
> way?
>
> cheers,
> rog.
>