>> What causes the schema normalization to be incomplete?
Bad implementation, I use C++ avro and it's not complete and not very
active.
>And is that a problem? As long as the reader can get the schema, it
shouldn't matter that there are duplicates – as long as the >differences
between the duplicates do not affect decoding.
Not really a problem, we tend to use machine generated schemas and they are
always identical.
I think there are holes in the simplification of types if I remember
correctly.
Namespaces should be collapsed,
{"type" : "string"} -> "string" etc
Current implementation can't reliably decide if two types are identical. If
you correct the problem later then a registered schema would actually
change it's hash since it now can be simplified. If this is a problem
depends on your application.
We currently encode this as you suggest <schema_type (byte)><schema_id
(32/128bits)><avro (binary)>
The binary fields should probably have a defined endianness also.
I agree on that a defacto way of encoding this would be nice. Currently I
would say that the confluent / linkedin way is the normal....