Hi,

Using a JSON encoding to bridge Avro to/from JSON is indeed a good idea.

But the systems still need to talk the same “language” (data structure). I 
rarely encounter systems that allow fully free-form objects (and never in 
production); there’s always some data structure behind it. This data structure 
(dict/struct/record/…) limits what can be transferred, and in at least 19 out 
of 20 cases covers records (with required&optional fields), arrays and 
primitives.

The cases where I do see more complex data structures, the ones that use the 
more advanced XML features (like mixing namespaces), free-form JSON options 
(like mixing fixed properties with patternProperties/additionalProperties), are 
generally so much tied to that format that bridging to Avro makes little to no 
sense.

That doesn’t mean that there is no use case, but it does mean that you’re more 
helped with a dedicated parser that emits Avro records. Kind of like I’ve 
dabbled a bit with: https://github.com/opwvhk/avro-conversions


Kind regards,
Oscar

-- 
Oscar Westra van Holthe - Kind <opw...@apache.org>

> On 23 Apr 2024, at 15:40, Clemens Vasters via user <user@avro.apache.org> 
> wrote:
> 
> I don't think you get around either maps or unions for a model where Avro 
> Schema can describe a JSON originating from an existing producer that isn't 
> aware of Avro Schema being used by the consumer. That is the test I would 
> apply for whether the encoder (or decoder in this case) is practically 
> useful. Avro Binary sufficiently covers the scenario where both parties are 
> known to be implemented with Avro. JSON is primarily useful as a bridge 
> to/from producers and consumers which do not use Avro bits and thus likely 
> not Avro Schema.
> Von: Oscar Westra van Holthe - Kind <opw...@apache.org>
> Gesendet: Tuesday, April 23, 2024 1:10:16 PM
> An: user@avro.apache.org <user@avro.apache.org>
> Betreff: Re: Avro JSON Encoding
>  
> Sie erhalten nicht oft eine E-Mail von opw...@apache.org. Erfahren Sie, warum 
> dies wichtig ist <https://aka.ms/LearnAboutSenderIdentification>        
> Hi everyone,
> 
> Having looked a bit more into what I usually see when using JSON to transfer 
> data, I think we can limit cross-format support (what this essentially is) to 
> a common denominator as we can see between Python objects / dicts, Rust 
> structs, Java POJOs/records, and Parquet MessageTypes, just to name a few.
> 
> This essentially boils down to all Avro constructs except for maps and unions 
> other than a single type plus null (i.e., the recent ‘?’ addition to the IDL 
> syntax). It also means we can omit support for most of the esoteric JSON 
> schema constructs, like additionalProperties, patternProperties, 
> if/then/else, etc.
> 
> 
> However, as Ryan noted, it still makes sense to find a way to promote the 
> Avro binary format. Especially the single-message encoding, I’d add: most 
> questions to use JSON, for example, are for single records. Currently 
> however, only the Rust and Java SDKs mention the byte marker for the 
> single-message encoding at all. It’s very much lacking from Python.
> 
> In fact, if we want to promote the use of Avro (and especially its binary 
> format), we must have a better documentation and implementation of the 
> single-message encoding.
> 
> 
> Kind regards,
> Oscar
> 
> -- 
> Oscar Westra van Holthe - Kind <opw...@apache.org>
> 
>> On 19 Apr 2024, at 23:45, Andrew Otto <o...@wikimedia.org> wrote:
>> 
>> > There's probably a nice balance between a rigorous and interoperable (but 
>> > less customizable) JSON encoding, and trying to accommodate arbitrary JSON 
>> > in the Avro project.
>> 
>> For my own purposes, I'd only need a very limited set of JSON support. For 
>> event streaming, we limit JSONSchema usages to those that can be easily and 
>> explicitly mapped to SQL (Hive, Spark, Flink) type systems. e.g. No 
>> undefined additionalProperties 
>> <https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#No_object_additionalProperties>,
>>  no union types 
>> <https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#No_union_types_/_No_null_values>,
>>  etc. etc.  
>> 
>> 
>> 
>> 
>> On Fri, Apr 19, 2024 at 11:58 AM Ryan Skraba <r...@skraba.com 
>> <mailto:r...@skraba.com>> wrote:
>> Hello!
>> 
>> A bit tongue in cheek: the one advantage of the current Avro JSON
>> encoding is that it drives users rapidly to prefer the binary
>> encoding!  In its current state, Avro isn't really a satisfactory
>> toolkit for JSON interoperability, while it shines for binary
>> interoperability. Using JSON with Avro schemas is pretty unwieldy and
>> a JSON data designer will almost never be entirely satisfied with the
>> JSON "shape" they can get... today it's useful for testing and
>> debugging.
>> 
>> That being said, it's hard to argue with improving this experience
>> where it can help developers that really want to use Avro JSON for
>> data transfer, especially for things accepting JSON where the
>> intention is clearly unambiguous or allowing optional attributes to be
>> missing.  I'd be enthusiastic to see some of these improvements,
>> especially if we keep the possibility of generating strict Avro JSON
>> for forwards and backwards compatibility.
>> 
>> My preference would be to avoid adding JSON-specific attributes to the
>> spec where possible.  Maybe we could consider implementing Avro JSON
>> "variants" by implementing encoder options, or alternative encorders
>> for an SDK. There's probably a nice balance between a rigorous and
>> interoperable (but less customizable) JSON encoding, and trying to
>> accommodate arbitrary JSON in the Avro project.
>> 
>> All my best and thanks for this analysis -- I'm excited to see where
>> this leads!  Ryan
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Apr 18, 2024 at 8:01 PM Oscar Westra van Holthe - Kind
>> <os...@westravanholthe.nl <mailto:os...@westravanholthe.nl>> wrote:
>> >
>> > Thank you Clemens,
>> >
>> > This is a very detailed set of proposals, and it looks like it would work.
>> >
>> > I do however, feel we'd need to define a way to unions with records. Your 
>> > proposal lists various options, of which the discriminatory option seems 
>> > most portable to me.
>> >
>> > You mention the "displayName" proposal. I don't like that, as it mixes 
>> > data with UI elements. The discriminator option can specify a fixed or 
>> > configurable field to hold the type of the record.
>> >
>> > Kind regards,
>> > Oscar
>> >
>> >
>> > --
>> > Oscar Westra van Holthe - Kind <os...@westravanholthe.nl 
>> > <mailto:os...@westravanholthe.nl>>
>> >
>> > Op do 18 apr. 2024 10:12 schreef Clemens Vasters via user 
>> > <user@avro.apache.org <mailto:user@avro.apache.org>>:
>> >>
>> >> Hi everyone,
>> >>
>> >>
>> >>
>> >> the current JSON Encoding approach severely limits interoperability with 
>> >> other JSON serialization frameworks. In my view, the JSON Encoding is 
>> >> only really useful if it acts as a bridge into and from JSON-centric 
>> >> applications and it currently gets in its own way.
>> >>
>> >>
>> >>
>> >> The current encoding being what it is, there should be an alternate mode 
>> >> that emphasizes interoperability with JSON “as-is” and allows Avro Schema 
>> >> to describe existing JSON document instances such that I can take 
>> >> someone’s existing JSON document in on one side of a piece of software 
>> >> and emit Avro binary on the other side while acting on the same schema.
>> >>
>> >>
>> >>
>> >> There are four specific issues:
>> >>
>> >>
>> >>
>> >> Binary Values
>> >> Unions with Primitive Type Values and Enum Values
>> >> Unions with Record Values
>> >> DateTime
>> >>
>> >>
>> >>
>> >> One by one:
>> >>
>> >>
>> >>
>> >> 1. Binary values:
>> >>
>> >> ---------------------
>> >>
>> >>
>> >>
>> >> Binary values are (fixed and bytes) are encoded as escaped unicode 
>> >> literals. While I appreciate the creative trick, it costs 6 bytes for 
>> >> each encoded byte. I have a hard time finding any JSON libraries that 
>> >> provide a conversion of such strings from/to byte arrays, so this 
>> >> approach appears to be idiosyncratic for Avro’s JSON Encoding.
>> >>
>> >>
>> >>
>> >> The common way to encode binary in JSON is to use base64 encoding and 
>> >> that is widely and well supported in libraries. Base64 is 33% larger than 
>> >> plain bytes, the encoding chosen here is 500% (!) larger than plain bytes.
>> >>
>> >>
>> >>
>> >> The Avro decoder is schema-informed and it knows that a field is expected 
>> >> to hold bytes, so it’s easy to mandate base64 for the field content in 
>> >> the alternate mode.
>> >>
>> >>
>> >>
>> >> 2. Unions with Primitive Type Values and Enum Values
>> >>
>> >> ---------------------
>> >>
>> >>
>> >>
>> >> It’s common to express optionality in Avro Schema by creating a union 
>> >> with the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding 
>> >> opts to encode such unions, like any union, as { “{type}”: {value} } when 
>> >> the value is non-null.
>> >>
>> >>
>> >>
>> >> This choice ignores common practice and the fact that JSON’s values are 
>> >> dynamically typed (RFC8259 Section-3) and inherently accommodate unions. 
>> >> The conformant way to encode a value choice of null or “string” into a 
>> >> JSON value is plainly null and “string”.
>> >>
>> >>
>> >>
>> >> “foo” : null
>> >>
>> >> “foo”: “value”
>> >>
>> >>
>> >>
>> >> The “field default values” table in the Avro spec maps Avro types to the 
>> >> JSON types null, boolean, integer, number, string, object, and array, all 
>> >> of which can be encoded into and, more importantly, unambiguously decoded 
>> >> from a JSON value. The only semi-ambiguous case is integer vs. number, 
>> >> which is a convention in JSON rather than a distinct type, but any Avro 
>> >> serializer is guided by type information and can easily make that 
>> >> distinction.
>> >>
>> >>
>> >>
>> >> 3. Unions with Record Values
>> >>
>> >> ---------------------
>> >>
>> >>
>> >>
>> >> The JSON Encoding pattern of unions also covers “record” typed values, of 
>> >> course, and this is indeed a tricky scenario during deserialization since 
>> >> JSON does not have any built-in notion of type hints for “object” typed 
>> >> values.
>> >>
>> >>
>> >>
>> >> The problem of having to disambiguate instances of different types in a 
>> >> field value is a common one also for users of JSON Schema when using the 
>> >> “oneOf” construct, which is equivalent to Avro unions. There are two 
>> >> common strategies:
>> >>
>> >>
>> >>
>> >> - “Duck Typing”:  Every conformant JSON Schema Validator determines the 
>> >> validity of a JSON node against a “oneOf" rule by testing the instance 
>> >> against all available alternative schema definitions. Validation fails if 
>> >> there is not exactly one valid match.
>> >>
>> >> - Discriminators: OpenAPI, for instance, mandates a “discriminator” field 
>> >> (see https://spec.openapis.org/oas/latest.html#discriminator-object) for 
>> >> disambiguating “oneOf” constructs, whereby the discriminator property is 
>> >> part of each instance. That approach informs numerous JSON serialization 
>> >> frameworks, which implement discriminators under that assumption.
>> >>
>> >>
>> >>
>> >> The Java Jackson library indeed supports the Avro JSON Encoding’s style 
>> >> of putting the discriminator into a wrapper field name (JsonTypeInfo 
>> >> annotation, JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only 
>> >> support the property approach, though, including the two dominant ones 
>> >> for .NET, Pydantic of Python, and others. There’s tooling like Redocly 
>> >> that flags that approach as a “mistake” (see 
>> >> https://redocly.com/docs/resources/discriminator/#property-outside-of-the-object).
>> >>
>> >>
>> >>
>> >> What that means is that most existing JSON instances with ambiguous types 
>> >> will either use property discriminators or the implementation will rely 
>> >> on duck typing as JSON Schema does for validation. The Avro JSON Encoding 
>> >> approach is rare and is also counterintuitive for anyone comparing the 
>> >> declared object structure and the JSON structure who is not familiar with 
>> >> Avro’s encoding rules. It has confused a lot of people in our house, for 
>> >> sure.
>> >>
>> >>
>> >>
>> >> Proposed is the following approach:
>> >>
>> >>
>> >>
>> >> a) add a new, optional “const” attribute that can be applied to any 
>> >> record field declaration that is of a primitive type. When present, the 
>> >> attribute causes the field to always have this value. In Avro binary 
>> >> encoding, the field is not transmitted, at all, but the decoder yields it 
>> >> with the given value. In Avro JSON encoding, the field is emitted and for 
>> >> serialization to succeed for the record type, the field must be present 
>> >> with the given value.
>> >>
>> >> b) perform disambiguation of types by the same principle as JSON Schema 
>> >> for oneOf, with a performance preference for matching fields flagged with 
>> >> “const” against the incoming JSON node. When the deserializer is 
>> >> configured by schema to know what fields and values to look for, there 
>> >> should not be no performance hit compared to the current approach.  
>> >> Derialization fails if there is not one unambiguous match. That is 
>> >> exactly in line with what JSON Schema validation implementations do. JSON 
>> >> Schema also has a “const” construct. “Const” or single-valued enums are 
>> >> often used as discriminator helpers with JSON Schema’s oneOf.
>> >>
>> >> c) optional: add a new, optional “displayname” attribute that can hold an 
>> >> alternate name for the field without the restrictions of the “name” 
>> >> character set, so that discriminators like “$type” can be matched. A 
>> >> further upside of adding this field is that it can generally be used to 
>> >> match international characters in JSON object keys, which are obviously 
>> >> permitted there.
>> >>
>> >>
>> >>
>> >> 4. Date Time
>> >>
>> >> ---------------------
>> >>
>> >>
>> >>
>> >> JSON data generally leans on the RFC3339 profile of ISO8601 for dates and 
>> >> durations, not the last because JSON Schema defines these choices as 
>> >> “format” variants for strings.
>> >>
>> >>
>> >>
>> >> If the incoming type of a field is a string instead of a number, JSON 
>> >> deserialization in the alternate mode should interpret the logicalTypes 
>> >> for dates as follows.
>> >>
>> >>
>> >>
>> >> “date” – RFC3339 5.6. “full-date”
>> >> “time-millis” – RFC3339 5.6. “date-time”
>> >> “time-micros” – RFC3339 5.6. “partial-time”
>> >> “timestamp-millis” – RFC3339 5.6 “date-time”
>> >> “timestamp-micros”—RFC3339 5.6 “date-time”
>> >> “local-timestamp-millis” – RFC3339 5.6 “date-time”, ignoring offset (but 
>> >> see RFC 3339 4.4)
>> >> “local-timestamp-micros”—RFC3339 5.6 “date-time” , ignoring offset (but 
>> >> see RFC 3339 4.4)
>> >> “duration” – RFC3339 Appendix A “duration”
>> >>
>> >>
>> >>
>> >> The JSON serialization in the alternate mode should have an option, and 
>> >> default to, serializing dates as strings. Deserialization parsers MAY be 
>> >> lenient and also accept RFC1123 5.2.13 date time strings where RFC3339 
>> >> 5.6 “date-time” is specified, but I’d make that an implementation choice.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Best Regards
>> >>
>> >> Clemens Vasters
>> >>
>> >>
> 

Reply via email to