Hi Clemens,

I propose to wait a bit to give a chance to the community to review
your email and points.

Then, we will create the Jira accordingly.

Regards
JB

On Thu, Apr 18, 2024 at 9:20 AM Clemens Vasters <cleme...@microsoft.com> wrote:
>
> Hi JB,
>
>
>
> I have not done that yet. I’m happy to break that up into items once I get 
> the sense that this is a direction we can get to a consensus on.
>
>
>
> Shall I file the whole email as a “New Feature” issue first?
>
>
>
> Thanks
>
> Clemens
>
>
>
> From: Jean-Baptiste Onofré <j...@nanthrax.net>
> Sent: Thursday, April 18, 2024 10:17 AM
> To: Clemens Vasters <cleme...@microsoft.com>; user@avro.apache.org
> Subject: Re: Avro JSON Encoding
>
>
>
> Hi Clemens
>
>
>
> Thanks for the detailed email.
>
>
>
> Quick question : did you already create Jira about each improvements/issues ?
>
>
>
> I will take the time to read asap.
>
>
>
> Thanks
>
> Regards
>
> JB
>
>
>
> Le jeu. 18 avr. 2024 à 09:12, Clemens Vasters via user <user@avro.apache.org> 
> a écrit :
>
> Hi everyone,
>
>
>
> the current JSON Encoding approach severely limits interoperability with 
> other JSON serialization frameworks. In my view, the JSON Encoding is only 
> really useful if it acts as a bridge into and from JSON-centric applications 
> and it currently gets in its own way.
>
>
>
> The current encoding being what it is, there should be an alternate mode that 
> emphasizes interoperability with JSON “as-is” and allows Avro Schema to 
> describe existing JSON document instances such that I can take someone’s 
> existing JSON document in on one side of a piece of software and emit Avro 
> binary on the other side while acting on the same schema.
>
>
>
> There are four specific issues:
>
>
>
> Binary Values
> Unions with Primitive Type Values and Enum Values
> Unions with Record Values
> DateTime
>
>
>
> One by one:
>
>
>
> 1. Binary values:
>
> ---------------------
>
>
>
> Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
> While I appreciate the creative trick, it costs 6 bytes for each encoded 
> byte. I have a hard time finding any JSON libraries that provide a conversion 
> of such strings from/to byte arrays, so this approach appears to be 
> idiosyncratic for Avro’s JSON Encoding.
>
>
>
> The common way to encode binary in JSON is to use base64 encoding and that is 
> widely and well supported in libraries. Base64 is 33% larger than plain 
> bytes, the encoding chosen here is 500% (!) larger than plain bytes.
>
>
>
> The Avro decoder is schema-informed and it knows that a field is expected to 
> hold bytes, so it’s easy to mandate base64 for the field content in the 
> alternate mode.
>
>
>
> 2. Unions with Primitive Type Values and Enum Values
>
> ---------------------
>
>
>
> It’s common to express optionality in Avro Schema by creating a union with 
> the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts to 
> encode such unions, like any union, as { “{type}”: {value} } when the value 
> is non-null.
>
>
>
> This choice ignores common practice and the fact that JSON’s values are 
> dynamically typed (RFC8259 Section-3) and inherently accommodate unions. The 
> conformant way to encode a value choice of null or “string” into a JSON value 
> is plainly null and “string”.
>
>
>
> “foo” : null
>
> “foo”: “value”
>
>
>
> The “field default values” table in the Avro spec maps Avro types to the JSON 
> types null, boolean, integer, number, string, object, and array, all of which 
> can be encoded into and, more importantly, unambiguously decoded from a JSON 
> value. The only semi-ambiguous case is integer vs. number, which is a 
> convention in JSON rather than a distinct type, but any Avro serializer is 
> guided by type information and can easily make that distinction.
>
>
>
> 3. Unions with Record Values
>
> ---------------------
>
>
>
> The JSON Encoding pattern of unions also covers “record” typed values, of 
> course, and this is indeed a tricky scenario during deserialization since 
> JSON does not have any built-in notion of type hints for “object” typed 
> values.
>
>
>
> The problem of having to disambiguate instances of different types in a field 
> value is a common one also for users of JSON Schema when using the “oneOf” 
> construct, which is equivalent to Avro unions. There are two common 
> strategies:
>
>
>
> - “Duck Typing”:  Every conformant JSON Schema Validator determines the 
> validity of a JSON node against a “oneOf" rule by testing the instance 
> against all available alternative schema definitions. Validation fails if 
> there is not exactly one valid match.
>
> - Discriminators: OpenAPI, for instance, mandates a “discriminator” field 
> (see https://spec.openapis.org/oas/latest.html#discriminator-object) for 
> disambiguating “oneOf” constructs, whereby the discriminator property is part 
> of each instance. That approach informs numerous JSON serialization 
> frameworks, which implement discriminators under that assumption.
>
>
>
> The Java Jackson library indeed supports the Avro JSON Encoding’s style of 
> putting the discriminator into a wrapper field name (JsonTypeInfo annotation, 
> JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only support the 
> property approach, though, including the two dominant ones for .NET, Pydantic 
> of Python, and others. There’s tooling like Redocly that flags that approach 
> as a “mistake” (see 
> https://redocly.com/docs/resources/discriminator/#property-outside-of-the-object).
>
>
>
> What that means is that most existing JSON instances with ambiguous types 
> will either use property discriminators or the implementation will rely on 
> duck typing as JSON Schema does for validation. The Avro JSON Encoding 
> approach is rare and is also counterintuitive for anyone comparing the 
> declared object structure and the JSON structure who is not familiar with 
> Avro’s encoding rules. It has confused a lot of people in our house, for sure.
>
>
>
> Proposed is the following approach:
>
>
>
> a) add a new, optional “const” attribute that can be applied to any record 
> field declaration that is of a primitive type. When present, the attribute 
> causes the field to always have this value. In Avro binary encoding, the 
> field is not transmitted, at all, but the decoder yields it with the given 
> value. In Avro JSON encoding, the field is emitted and for serialization to 
> succeed for the record type, the field must be present with the given value.
>
> b) perform disambiguation of types by the same principle as JSON Schema for 
> oneOf, with a performance preference for matching fields flagged with “const” 
> against the incoming JSON node. When the deserializer is configured by schema 
> to know what fields and values to look for, there should not be no 
> performance hit compared to the current approach.  Derialization fails if 
> there is not one unambiguous match. That is exactly in line with what JSON 
> Schema validation implementations do. JSON Schema also has a “const” 
> construct. “Const” or single-valued enums are often used as discriminator 
> helpers with JSON Schema’s oneOf.
>
> c) optional: add a new, optional “displayname” attribute that can hold an 
> alternate name for the field without the restrictions of the “name” character 
> set, so that discriminators like “$type” can be matched. A further upside of 
> adding this field is that it can generally be used to match international 
> characters in JSON object keys, which are obviously permitted there.
>
>
>
> 4. Date Time
>
> ---------------------
>
>
>
> JSON data generally leans on the RFC3339 profile of ISO8601 for dates and 
> durations, not the last because JSON Schema defines these choices as “format” 
> variants for strings.
>
>
>
> If the incoming type of a field is a string instead of a number, JSON 
> deserialization in the alternate mode should interpret the logicalTypes for 
> dates as follows.
>
>
>
> “date” – RFC3339 5.6. “full-date”
> “time-millis” – RFC3339 5.6. “date-time”
> “time-micros” – RFC3339 5.6. “partial-time”
> “timestamp-millis” – RFC3339 5.6 “date-time”
> “timestamp-micros”—RFC3339 5.6 “date-time”
> “local-timestamp-millis” – RFC3339 5.6 “date-time”, ignoring offset (but see 
> RFC 3339 4.4)
> “local-timestamp-micros”—RFC3339 5.6 “date-time” , ignoring offset (but see 
> RFC 3339 4.4)
> “duration” – RFC3339 Appendix A “duration”
>
>
>
> The JSON serialization in the alternate mode should have an option, and 
> default to, serializing dates as strings. Deserialization parsers MAY be 
> lenient and also accept RFC1123 5.2.13 date time strings where RFC3339 5.6 
> “date-time” is specified, but I’d make that an implementation choice.
>
>
>
>
>
> Best Regards
>
> Clemens Vasters
>
>

Reply via email to