Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Zoltan Farkas Fri, 20 Dec 2019 13:07:21 -0800

Hi Roger,

have you considered  leveraging  avro logical types, and keep the payload and 
event metadata “separate”?


Here is a example (will use avro idl, since that is more readable to me :-) ):

record MetaData {
        @logicalType(“instant") string timeStamp;
        ….. all the meta data fields...
}

record CloudEvent {

        MetaData metaData;

        Any payload;

}

@logicalType(“any")
record Any {

        /** here you have the schema of the data, for efficiency, you can use a 
schema id + schema repo, or something like 
https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences 
<https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences> */
        string schema;

        bytes data;

}

this way a system that is interested in the metadata does not even have to 
deserialize the payload….

hope it helps.

—Z


> On Dec 18, 2019, at 11:49 AM, roger peppe <[email protected]> wrote:
> 
> Hi,
> 
> Background: I've been contemplating the proposed Avro format in the 
> CloudEvent specification 
> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which 
> defines standard metadata for events. It defines a very generic format for an 
> event that allows storage of almost any data. It seems to me that by going in 
> that direction it's losing almost all the advantages of using Avro in the 
> first place. It feels like it's trying to shoehorn a dynamic message format 
> like JSON into the Avro format, where using Avro itself could do so much 
> better.
> 
> I'm hoping to propose something better. I had what I thought was a nice idea, 
> but it doesn't quite work, and I thought I'd bring up the subject here and 
> see if anyone had some better ideas.
> 
> The schema resolution 
> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part of 
> the spec allows a reader to read a schema that was written with extra fields. 
> So, theoretically, we could define a CloudEvent something like this:
> 
> {
>     "name": "CloudEvent",
>     "type": "record",
>     "fields": [{
>             "name": "Metadata",
>             "type": {
>                 "type": "record",
>                 "name": "CloudEvent",
>                 "namespace": "avro.apache.org <http://avro.apache.org/>",
>                 "fields": [{
>                         "name": "id",
>                         "type": "string"
>             }, {
>                         "name": "source",
>                         "type": "string"
>             }, {
>                         "name": "time",
>                         "type": "long",
>                         "logicalType": "timestamp-micros"
>             }]
>         }
>     }]
> }
> 
> Theoretically, this could enable any event that's a record that has at least 
> a Metadata field with the above fields to be read generically. The CloudEvent 
> type above could be seen as a structural supertype of all possible 
> more-specific CloudEvent-compatible records that have such a compatible field.
> 
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the 
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the 
> metadata and the payload.
> 
> However, this idea fails because of one problem - this schema resolution 
> rule: "both schemas are records with the same (unqualified) name". This means 
> that unless everyone names all their CloudEvent-compatible records 
> "CloudEvent", they can't be read like this.
> 
> I don't think people will be willing to name all their records "CloudEvent", 
> so we have a problem.
> 
> I can see a few possible workarounds:
> when reading the record as a CloudEvent, read it with a schema that's the 
> same as CloudEvent, but with the top level record name changed to the top 
> level name of the schema that was used to write the record.
> ignore record names when matching schema record types.
> allow aliases to be specified when writing data as well as reading it. When 
> defining a CloudEvent-compatible event, you'd add a CloudEvent alias to your 
> record.
> None of the options are particularly nice. 1 is probably the easiest to do, 
> although means you'd still need some custom logic when decoding records, 
> meaning you couldn't use stock decoders.
> 
> I like the idea of 2, although it gets a bit tricky when dealing with union 
> types. You could define the matching such that it ignores names only when the 
> two matched types are unambiguous (i.e. only one record in both). This could 
> be implemented as an option ("use structural typing") when decoding.
> 
> 3 is probably cleanest but interacts significantly with the spec (for 
> example, the canonical schema transformation strips aliases out, but they'd 
> need to be retained).
> 
> Any thoughts? Is this a silly thing to be contemplating? Is there a better 
> way?
> 
>   cheers,
>     rog.
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Reply via email to