Avro JSON Encoding

Clemens Vasters via user Thu, 18 Apr 2024 01:13:36 -0700

Hi everyone,

the current JSON Encoding approach severely limits interoperability with other 
JSON serialization frameworks. In my view, the JSON Encoding is only really 
useful if it acts as a bridge into and from JSON-centric applications and it 
currently gets in its own way.


The current encoding being what it is, there should be an alternate mode that 
emphasizes interoperability with JSON "as-is" and allows Avro Schema to 
describe existing JSON document instances such that I can take someone's 
existing JSON document in on one side of a piece of software and emit Avro 
binary on the other side while acting on the same schema.

There are four specific issues:


  1.  Binary Values
  2.  Unions with Primitive Type Values and Enum Values
  3.  Unions with Record Values
  4.  DateTime

One by one:

1. Binary values:
---------------------

Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
While I appreciate the creative trick, it costs 6 bytes for each encoded byte. 
I have a hard time finding any JSON libraries that provide a conversion of such 
strings from/to byte arrays, so this approach appears to be idiosyncratic for 
Avro's JSON Encoding.

The common way to encode binary in JSON is to use base64 encoding and that is 
widely and well supported in libraries. Base64 is 33% larger than plain bytes, 
the encoding chosen here is 500% (!) larger than plain bytes.

The Avro decoder is schema-informed and it knows that a field is expected to 
hold bytes, so it's easy to mandate base64 for the field content in the 
alternate mode.

2. Unions with Primitive Type Values and Enum Values
---------------------

It's common to express optionality in Avro Schema by creating a union with the 
"null" type, e.g. ["string", "null"]. The Avro JSON Encoding opts to encode 
such unions, like any union, as { "{type}": {value} } when the value is 
non-null.

This choice ignores common practice and the fact that JSON's values are 
dynamically typed (RFC8259 
Section-3<https://www.rfc-editor.org/rfc/rfc8259#section-3>) and inherently 
accommodate unions. The conformant way to encode a value choice of null or 
"string" into a JSON value is plainly null and "string".

"foo" : null
"foo": "value"

The "field default values" table in the Avro spec maps Avro types to the JSON 
types null, boolean, integer, number, string, object, and array, all of which 
can be encoded into and, more importantly, unambiguously decoded from a JSON 
value. The only semi-ambiguous case is integer vs. number, which is a 
convention in JSON rather than a distinct type, but any Avro serializer is 
guided by type information and can easily make that distinction.

3. Unions with Record Values
---------------------

The JSON Encoding pattern of unions also covers "record" typed values, of 
course, and this is indeed a tricky scenario during deserialization since JSON 
does not have any built-in notion of type hints for "object" typed values.

The problem of having to disambiguate instances of different types in a field 
value is a common one also for users of JSON Schema when using the "oneOf" 
construct, which is equivalent to Avro unions. There are two common strategies:

- "Duck Typing":  Every conformant JSON Schema Validator determines the 
validity of a JSON node against a "oneOf" rule by testing the instance against 
all available alternative schema definitions. Validation fails if there is not 
exactly one valid match.
- Discriminators: OpenAPI, for instance, mandates a "discriminator" field (see 
https://spec.openapis.org/oas/latest.html#discriminator-object) for 
disambiguating "oneOf" constructs, whereby the discriminator property is part 
of each instance. That approach informs numerous JSON serialization frameworks, 
which implement discriminators under that assumption.

The Java Jackson library indeed supports the Avro JSON Encoding's style of 
putting the discriminator into a wrapper field name (JsonTypeInfo annotation, 
JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only support the 
property approach, though, including the two dominant ones for .NET, Pydantic 
of Python, and others. There's tooling like Redocly that flags that approach as 
a "mistake" (see 
https://redocly.com/docs/resources/discriminator/#property-outside-of-the-object).

What that means is that most existing JSON instances with ambiguous types will 
either use property discriminators or the implementation will rely on duck 
typing as JSON Schema does for validation. The Avro JSON Encoding approach is 
rare and is also counterintuitive for anyone comparing the declared object 
structure and the JSON structure who is not familiar with Avro's encoding 
rules. It has confused a lot of people in our house, for sure.

Proposed is the following approach:

a) add a new, optional "const" attribute that can be applied to any record 
field declaration that is of a primitive type. When present, the attribute 
causes the field to always have this value. In Avro binary encoding, the field 
is not transmitted, at all, but the decoder yields it with the given value. In 
Avro JSON encoding, the field is emitted and for serialization to succeed for 
the record type, the field must be present with the given value.
b) perform disambiguation of types by the same principle as JSON Schema for 
oneOf, with a performance preference for matching fields flagged with "const" 
against the incoming JSON node. When the deserializer is configured by schema 
to know what fields and values to look for, there should not be no performance 
hit compared to the current approach.  Derialization fails if there is not one 
unambiguous match. That is exactly in line with what JSON Schema validation 
implementations do. JSON Schema also has a "const" construct. "Const" or 
single-valued enums are often used as discriminator helpers with JSON Schema's 
oneOf.

c) optional: add a new, optional "displayname" attribute that can hold an 
alternate name for the field without the restrictions of the "name" character 
set, so that discriminators like "$type" can be matched. A further upside of 
adding this field is that it can generally be used to match international 
characters in JSON object keys, which are obviously permitted there.

4. Date Time
---------------------

JSON data generally leans on the RFC3339 profile of ISO8601 for dates and 
durations, not the last because JSON Schema defines these choices as "format" 
variants for strings.

If the incoming type of a field is a string instead of a number, JSON 
deserialization in the alternate mode should interpret the logicalTypes for 
dates as follows.


  *   "date" - RFC3339 5.6. "full-date"
  *   "time-millis" - RFC3339 5.6. "date-time"
  *   "time-micros" - RFC3339 5.6. "partial-time"
  *   "timestamp-millis" - RFC3339 5.6 "date-time"
  *   "timestamp-micros"-RFC3339 5.6 "date-time"
  *   "local-timestamp-millis" - RFC3339 5.6 "date-time", ignoring offset (but 
see RFC 3339 4.4)
  *   "local-timestamp-micros"-RFC3339 5.6 "date-time" , ignoring offset (but 
see RFC 3339 4.4)
  *   "duration" - RFC3339 Appendix A "duration"

The JSON serialization in the alternate mode should have an option, and default 
to, serializing dates as strings. Deserialization parsers MAY be lenient and 
also accept RFC1123 5.2.13 date time strings where RFC3339 5.6 "date-time" is 
specified, but I'd make that an implementation choice.


Best Regards
Clemens Vasters

Avro JSON Encoding

Reply via email to