Hi Clemens

Yeah it makes sense. I will be back from the US next week so I will have
time to work with you on that and also move forward on the releases.

Thanks !
Regards
JB

Le mer. 24 avr. 2024 à 07:03, Clemens Vasters <cleme...@microsoft.com> a
écrit :

> Hi JB,
>
> since there seems to be interest in the group even if not full consensus
> on the scope, I propose that I open an umbrella issue on this with more
> specific focus on the "what"/"how" more than the "why" as I did in the
> opening email, which can then be broken down into individual feature
> issues. I can work on that early next week.
>
> Best Regards
> Clemens
>
> ------------------------------
> *Von:* Jean-Baptiste Onofré <j...@nanthrax.net>
> *Gesendet:* Donnerstag, April 18, 2024 10:58 AM
> *An:* Clemens Vasters <cleme...@microsoft.com>
> *Cc:* Jean-Baptiste Onofré <j...@nanthrax.net>; user@avro.apache.org <
> user@avro.apache.org>
> *Betreff:* Re: Avro JSON Encoding
>
> Hi Clemens,
>
> I propose to wait a bit to give a chance to the community to review
> your email and points.
>
> Then, we will create the Jira accordingly.
>
> Regards
> JB
>
> On Thu, Apr 18, 2024 at 9:20 AM Clemens Vasters <cleme...@microsoft.com>
> wrote:
> >
> > Hi JB,
> >
> >
> >
> > I have not done that yet. I’m happy to break that up into items once I
> get the sense that this is a direction we can get to a consensus on.
> >
> >
> >
> > Shall I file the whole email as a “New Feature” issue first?
> >
> >
> >
> > Thanks
> >
> > Clemens
> >
> >
> >
> > From: Jean-Baptiste Onofré <j...@nanthrax.net>
> > Sent: Thursday, April 18, 2024 10:17 AM
> > To: Clemens Vasters <cleme...@microsoft.com>; user@avro.apache.org
> > Subject: Re: Avro JSON Encoding
> >
> >
> >
> > Hi Clemens
> >
> >
> >
> > Thanks for the detailed email.
> >
> >
> >
> > Quick question : did you already create Jira about each
> improvements/issues ?
> >
> >
> >
> > I will take the time to read asap.
> >
> >
> >
> > Thanks
> >
> > Regards
> >
> > JB
> >
> >
> >
> > Le jeu. 18 avr. 2024 à 09:12, Clemens Vasters via user <
> user@avro.apache.org> a écrit :
> >
> > Hi everyone,
> >
> >
> >
> > the current JSON Encoding approach severely limits interoperability with
> other JSON serialization frameworks. In my view, the JSON Encoding is only
> really useful if it acts as a bridge into and from JSON-centric
> applications and it currently gets in its own way.
> >
> >
> >
> > The current encoding being what it is, there should be an alternate mode
> that emphasizes interoperability with JSON “as-is” and allows Avro Schema
> to describe existing JSON document instances such that I can take someone’s
> existing JSON document in on one side of a piece of software and emit Avro
> binary on the other side while acting on the same schema.
> >
> >
> >
> > There are four specific issues:
> >
> >
> >
> > Binary Values
> > Unions with Primitive Type Values and Enum Values
> > Unions with Record Values
> > DateTime
> >
> >
> >
> > One by one:
> >
> >
> >
> > 1. Binary values:
> >
> > ---------------------
> >
> >
> >
> > Binary values are (fixed and bytes) are encoded as escaped unicode
> literals. While I appreciate the creative trick, it costs 6 bytes for each
> encoded byte. I have a hard time finding any JSON libraries that provide a
> conversion of such strings from/to byte arrays, so this approach appears to
> be idiosyncratic for Avro’s JSON Encoding.
> >
> >
> >
> > The common way to encode binary in JSON is to use base64 encoding and
> that is widely and well supported in libraries. Base64 is 33% larger than
> plain bytes, the encoding chosen here is 500% (!) larger than plain bytes.
> >
> >
> >
> > The Avro decoder is schema-informed and it knows that a field is
> expected to hold bytes, so it’s easy to mandate base64 for the field
> content in the alternate mode.
> >
> >
> >
> > 2. Unions with Primitive Type Values and Enum Values
> >
> > ---------------------
> >
> >
> >
> > It’s common to express optionality in Avro Schema by creating a union
> with the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts
> to encode such unions, like any union, as { “{type}”: {value} } when the
> value is non-null.
> >
> >
> >
> > This choice ignores common practice and the fact that JSON’s values are
> dynamically typed (RFC8259 Section-3) and inherently accommodate unions.
> The conformant way to encode a value choice of null or “string” into a JSON
> value is plainly null and “string”.
> >
> >
> >
> > “foo” : null
> >
> > “foo”: “value”
> >
> >
> >
> > The “field default values” table in the Avro spec maps Avro types to the
> JSON types null, boolean, integer, number, string, object, and array, all
> of which can be encoded into and, more importantly, unambiguously decoded
> from a JSON value. The only semi-ambiguous case is integer vs. number,
> which is a convention in JSON rather than a distinct type, but any Avro
> serializer is guided by type information and can easily make that
> distinction.
> >
> >
> >
> > 3. Unions with Record Values
> >
> > ---------------------
> >
> >
> >
> > The JSON Encoding pattern of unions also covers “record” typed values,
> of course, and this is indeed a tricky scenario during deserialization
> since JSON does not have any built-in notion of type hints for “object”
> typed values.
> >
> >
> >
> > The problem of having to disambiguate instances of different types in a
> field value is a common one also for users of JSON Schema when using the
> “oneOf” construct, which is equivalent to Avro unions. There are two common
> strategies:
> >
> >
> >
> > - “Duck Typing”:  Every conformant JSON Schema Validator determines the
> validity of a JSON node against a “oneOf" rule by testing the instance
> against all available alternative schema definitions. Validation fails if
> there is not exactly one valid match.
> >
> > - Discriminators: OpenAPI, for instance, mandates a “discriminator”
> field (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspec.openapis.org%2Foas%2Flatest.html%23discriminator-object&data=05%7C02%7Cclemensv%40microsoft.com%7C0c30f7a27aa047f51fc708dc5f85a228%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638490274817115049%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=nE%2BpkiKj16HSD%2FhgUX20xDL0TxBfkMHs%2BCLGydbT0ds%3D&reserved=0)
> <https://spec.openapis.org/oas/latest.html#discriminator-object> for
> disambiguating “oneOf” constructs, whereby the discriminator property is
> part of each instance. That approach informs numerous JSON serialization
> frameworks, which implement discriminators under that assumption.
> >
> >
> >
> > The Java Jackson library indeed supports the Avro JSON Encoding’s style
> of putting the discriminator into a wrapper field name (JsonTypeInfo
> annotation, JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only
> support the property approach, though, including the two dominant ones for
> .NET, Pydantic of Python, and others. There’s tooling like Redocly that
> flags that approach as a “mistake” (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fredocly.com%2Fdocs%2Fresources%2Fdiscriminator%2F%23property-outside-of-the-object&data=05%7C02%7Cclemensv%40microsoft.com%7C0c30f7a27aa047f51fc708dc5f85a228%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638490274817124281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=YzEWIPewSQnEk5AU8ILqSuZBVsGZwqXZW%2BgslxaSSaQ%3D&reserved=0)
> <https://redocly.com/docs/resources/discriminator/#property-outside-of-the-object>
> .
>
> >
> >
> >
> > What that means is that most existing JSON instances with ambiguous
> types will either use property discriminators or the implementation will
> rely on duck typing as JSON Schema does for validation. The Avro JSON
> Encoding approach is rare and is also counterintuitive for anyone comparing
> the declared object structure and the JSON structure who is not familiar
> with Avro’s encoding rules. It has confused a lot of people in our house,
> for sure.
> >
> >
> >
> > Proposed is the following approach:
> >
> >
> >
> > a) add a new, optional “const” attribute that can be applied to any
> record field declaration that is of a primitive type. When present, the
> attribute causes the field to always have this value. In Avro binary
> encoding, the field is not transmitted, at all, but the decoder yields it
> with the given value. In Avro JSON encoding, the field is emitted and for
> serialization to succeed for the record type, the field must be present
> with the given value.
> >
> > b) perform disambiguation of types by the same principle as JSON Schema
> for oneOf, with a performance preference for matching fields flagged with
> “const” against the incoming JSON node. When the deserializer is configured
> by schema to know what fields and values to look for, there should not be
> no performance hit compared to the current approach.  Derialization fails
> if there is not one unambiguous match. That is exactly in line with what
> JSON Schema validation implementations do. JSON Schema also has a “const”
> construct. “Const” or single-valued enums are often used as discriminator
> helpers with JSON Schema’s oneOf.
> >
> > c) optional: add a new, optional “displayname” attribute that can hold
> an alternate name for the field without the restrictions of the “name”
> character set, so that discriminators like “$type” can be matched. A
> further upside of adding this field is that it can generally be used to
> match international characters in JSON object keys, which are obviously
> permitted there.
> >
> >
> >
> > 4. Date Time
> >
> > ---------------------
> >
> >
> >
> > JSON data generally leans on the RFC3339 profile of ISO8601 for dates
> and durations, not the last because JSON Schema defines these choices as
> “format” variants for strings.
> >
> >
> >
> > If the incoming type of a field is a string instead of a number, JSON
> deserialization in the alternate mode should interpret the logicalTypes for
> dates as follows.
> >
> >
> >
> > “date” – RFC3339 5.6. “full-date”
> > “time-millis” – RFC3339 5.6. “date-time”
> > “time-micros” – RFC3339 5.6. “partial-time”
> > “timestamp-millis” – RFC3339 5.6 “date-time”
> > “timestamp-micros”—RFC3339 5.6 “date-time”
> > “local-timestamp-millis” – RFC3339 5.6 “date-time”, ignoring offset (but
> see RFC 3339 4.4)
> > “local-timestamp-micros”—RFC3339 5.6 “date-time” , ignoring offset (but
> see RFC 3339 4.4)
> > “duration” – RFC3339 Appendix A “duration”
> >
> >
> >
> > The JSON serialization in the alternate mode should have an option, and
> default to, serializing dates as strings. Deserialization parsers MAY be
> lenient and also accept RFC1123 5.2.13 date time strings where RFC3339 5.6
> “date-time” is specified, but I’d make that an implementation choice.
> >
> >
> >
> >
> >
> > Best Regards
> >
> > Clemens Vasters
> >
> >
>

Reply via email to