Thank you Clemens,

This is a very detailed set of proposals, and it looks like it would work.

I do however, feel we'd need to define a way to unions with records. Your
proposal lists various options, of which the discriminatory option seems
most portable to me.

You mention the "displayName" proposal. I don't like that, as it mixes data
with UI elements. The discriminator option can specify a fixed or
configurable field to hold the type of the record.

Kind regards,
Oscar


-- 
Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Op do 18 apr. 2024 10:12 schreef Clemens Vasters via user <
user@avro.apache.org>:

> Hi everyone,
>
>
>
> the current JSON Encoding approach severely limits interoperability with
> other JSON serialization frameworks. In my view, the JSON Encoding is only
> really useful if it acts as a bridge into and from JSON-centric
> applications and it currently gets in its own way.
>
>
>
> The current encoding being what it is, there should be an alternate mode
> that emphasizes interoperability with JSON “as-is” and allows Avro Schema
> to describe existing JSON document instances such that I can take someone’s
> existing JSON document in on one side of a piece of software and emit Avro
> binary on the other side while acting on the same schema.
>
>
>
> There are four specific issues:
>
>
>
>    1. Binary Values
>    2. Unions with Primitive Type Values and Enum Values
>    3. Unions with Record Values
>    4. DateTime
>
>
>
> One by one:
>
>
>
> 1. Binary values:
>
> ---------------------
>
>
>
> Binary values are (fixed and bytes) are encoded as escaped unicode
> literals. While I appreciate the creative trick, it costs 6 bytes for each
> encoded byte. I have a hard time finding any JSON libraries that provide a
> conversion of such strings from/to byte arrays, so this approach appears to
> be idiosyncratic for Avro’s JSON Encoding.
>
>
>
> The common way to encode binary in JSON is to use base64 encoding and that
> is widely and well supported in libraries. Base64 is 33% larger than plain
> bytes, the encoding chosen here is 500% (!) larger than plain bytes.
>
>
>
> The Avro decoder is schema-informed and it knows that a field is expected
> to hold bytes, so it’s easy to mandate base64 for the field content in the
> alternate mode.
>
>
>
> 2. Unions with Primitive Type Values and Enum Values
>
> ---------------------
>
>
>
> It’s common to express optionality in Avro Schema by creating a union with
> the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts to
> encode such unions, like any union, as { “{type}”: {value} } when the value
> is non-null.
>
>
>
> This choice ignores common practice and the fact that JSON’s values are
> dynamically typed (RFC8259 Section-3
> <https://www.rfc-editor.org/rfc/rfc8259#section-3>) and inherently
> accommodate unions. The conformant way to encode a value choice of null or
> “string” into a JSON value is plainly null and “string”.
>
>
>
> “foo” : null
>
> “foo”: “value”
>
>
>
> The “field default values” table in the Avro spec maps Avro types to the
> JSON types null, boolean, integer, number, string, object, and array, all
> of which can be encoded into and, more importantly, unambiguously decoded
> from a JSON value. The only semi-ambiguous case is integer vs. number,
> which is a convention in JSON rather than a distinct type, but any Avro
> serializer is guided by type information and can easily make that
> distinction.
>
>
>
> 3. Unions with Record Values
>
> ---------------------
>
>
>
> The JSON Encoding pattern of unions also covers “record” typed values, of
> course, and this is indeed a tricky scenario during deserialization since
> JSON does not have any built-in notion of type hints for “object” typed
> values.
>
>
>
> The problem of having to disambiguate instances of different types in a
> field value is a common one also for users of JSON Schema when using the
> “oneOf” construct, which is equivalent to Avro unions. There are two common
> strategies:
>
>
>
> - “Duck Typing”:  Every conformant JSON Schema Validator determines the
> validity of a JSON node against a “oneOf" rule by testing the instance
> against all available alternative schema definitions. Validation fails if
> there is not exactly one valid match.
>
> - Discriminators: OpenAPI, for instance, mandates a “discriminator” field
> (see https://spec.openapis.org/oas/latest.html#discriminator-object) for
> disambiguating “oneOf” constructs, whereby the discriminator property is
> part of each instance. That approach informs numerous JSON serialization
> frameworks, which implement discriminators under that assumption.
>
>
>
> The Java Jackson library indeed supports the Avro JSON Encoding’s style of
> putting the discriminator into a wrapper field name (JsonTypeInfo
> annotation, JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only
> support the property approach, though, including the two dominant ones for
> .NET, Pydantic of Python, and others. There’s tooling like Redocly that
> flags that approach as a “mistake” (see
> https://redocly.com/docs/resources/discriminator/#property-outside-of-the-object
> ).
>
>
>
> What that means is that most existing JSON instances with ambiguous types
> will either use property discriminators or the implementation will rely on
> duck typing as JSON Schema does for validation. The Avro JSON Encoding
> approach is rare and is also counterintuitive for anyone comparing the
> declared object structure and the JSON structure who is not familiar with
> Avro’s encoding rules. It has confused a lot of people in our house, for
> sure.
>
>
>
> Proposed is the following approach:
>
>
>
> a) add a new, optional “const” attribute that can be applied to any record
> field declaration that is of a primitive type. When present, the attribute
> causes the field to always have this value. In Avro binary encoding, the
> field is not transmitted, at all, but the decoder yields it with the given
> value. In Avro JSON encoding, the field is emitted and for serialization to
> succeed for the record type, the field must be present with the given value.
>
> b) perform disambiguation of types by the same principle as JSON Schema
> for oneOf, with a performance preference for matching fields flagged with
> “const” against the incoming JSON node. When the deserializer is configured
> by schema to know what fields and values to look for, there should not be
> no performance hit compared to the current approach.  Derialization fails
> if there is not one unambiguous match. That is exactly in line with what
> JSON Schema validation implementations do. JSON Schema also has a “const”
> construct. “Const” or single-valued enums are often used as discriminator
> helpers with JSON Schema’s oneOf.
>
> c) optional: add a new, optional “displayname” attribute that can hold an
> alternate name for the field without the restrictions of the “name”
> character set, so that discriminators like “$type” can be matched. A
> further upside of adding this field is that it can generally be used to
> match international characters in JSON object keys, which are obviously
> permitted there.
>
>
>
> 4. Date Time
>
> ---------------------
>
>
>
> JSON data generally leans on the RFC3339 profile of ISO8601 for dates and
> durations, not the last because JSON Schema defines these choices as
> “format” variants for strings.
>
>
>
> If the incoming type of a field is a string instead of a number, JSON
> deserialization in the alternate mode should interpret the logicalTypes for
> dates as follows.
>
>
>
>    - “date” – RFC3339 5.6. “full-date”
>    - “time-millis” – RFC3339 5.6. “date-time”
>    - “time-micros” – RFC3339 5.6. “partial-time”
>    - “timestamp-millis” – RFC3339 5.6 “date-time”
>    - “timestamp-micros”—RFC3339 5.6 “date-time”
>    - “local-timestamp-millis” – RFC3339 5.6 “date-time”, ignoring offset
>    (but see RFC 3339 4.4)
>    - “local-timestamp-micros”—RFC3339 5.6 “date-time” , ignoring offset
>    (but see RFC 3339 4.4)
>    - “duration” – RFC3339 Appendix A “duration”
>
>
>
> The JSON serialization in the alternate mode should have an option, and
> default to, serializing dates as strings. Deserialization parsers MAY be
> lenient and also accept RFC1123 5.2.13 date time strings where RFC3339 5.6
> “date-time” is specified, but I’d make that an implementation choice.
>
>
>
>
>
> Best Regards
>
> Clemens Vasters
>
>
>

Reply via email to