Re: Gigantic list of Avro spec issues

Ryan Skraba Mon, 14 Feb 2022 09:18:42 -0800

Hello!

Thanks, Dan, for your calm and measured response -- you've given some
excellent advice on how someone can make a positive contribution to
the project and the spec.


Askar, your approach in presenting your specification review should
have been more constructive: it isn't useful to cross-post across
lists and recipients indiscriminately, and please remain civil when
pointing out (potential) errors.  I understand that the Avro
specification isn't perfect, but I think the community has always been
open to making changes and providing clarifications.

I'm glad that you've found some things to appreciate in the Avro
project, and I hope you choose to contribute in making the
improvements that you'd like to see.

Ryan

[1] https://www.apache.org/foundation/policies/conduct

On Sun, Feb 13, 2022 at 6:05 PM Dan Schmitt <dan.schm...@gmail.com> wrote:
>
> I will admit the spec is likely weak around unicode/utf encoding,
> (as the serialization of strings to json isn't consistent across all
> language bindings) but I file the ticket against implementations
> and write test cases rather than make guesses at the spec
> wording and make demands without knowing it's a real issue.
>
> On Sat, Feb 12, 2022 at 6:55 PM Dan Schmitt <dan.schm...@gmail.com> wrote:
> >
> > Generally the "I'm going to lump all my complaints into
> > one big bug" is a good way to get them ignored.
> >
> > I'll skip "the design is wrong and it should change because
> > I don't like it" and cite "it's used everywhere with lots of
> > implementations so you can't change it in an incompatible way".
> >
> > I'll skip obvious typos and suggest you catch more flies with
> > honey than vinegar, and you can work a git repository and
> > make a pull request for those and they'd be fixed fast.  (Or
> > read 2 or 3 of the 8 implementations if the phrasing is confusing
> > to you.)
> >
> > I'll make some suggestions on the technical details
> > So, specifically:
> >
> > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker
> > for this file". I don't like this point. It
> > implies that container files are usually not equal. Thus it is not
> > possible to compare them bitwise to determine
> > equality. So, in my Avro implementation I write null bytes instead of
> > this marker (yes, this possibly means that my
> > implementation is non-conforming)
> > * [Opinion] [Line 717]. There is no any marker for end of container
> > file. Thus there is no way to determine whether all
> > data was written
> >
> > If you use the sync marker, you don't need an end of container marker.
> > (Flush the sync and container block map with new data after the new
> > block is written, if you have the metadata and block list you know that
> > much is complete and written for you to read, if you read the metadata
> > and your sync byte marker is wrong, go re-read/continue read.)
> >
> > * [Very theoretical bug, possible even security-related] [Line 435].
> > Since you have a test case that proves it doesn't crash systems, it's
> > sort of not a bug right?  You could at the test case to the test suite.
> >
> > * [Bug] [Line 572]. "Currently for C/C++ implementations, the
> > positions are practically an int, but theoretically a
> > long". Wat? So, other implementations use int (as per spec), but C++
> > uses long, right? So, go fix C++ implementation to
> > match spec and other implementations
> >
> > This is not a bug, but an acknowledgement that the C/C++ offsets
> > are internally implemented via pointer math to be efficient but if
> > you try to read in enough data that a long offset makes sense,
> > you will be sad/run out of memory.   That the internal implementation
> > for C/C++ supports the minimum required by the specification.
> >
> > * [Bug] [Line 385]. "For example, int and long are always serialized
> > the same way". What this means? You probably mean
> > that *same* int and long (i. e. int and long, which are numerically
> > identical) serialized the same way.
> >
> > That rewrite is wrong.  Your wording would allow serialization to be
> > altered by value (e.g. it would be allowable to use big endian storage
> > for odd numbers and little endian for even as each same int and long
> > would be serialized the same way.)
> >
> > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a
> > JSON null". There is no reasons to special-case
> > nulls. This is additional requirement, which adds complexity to
> > implementations without any reasons
> >
> > You are making assumptions about implementation encoding of null.
> > A C developer would say writing 0x00 to the file that you will read back
> > later is fine for null or false or 0.
> >
> > * [Bug] [Line 417]. - can't take action, the request breaks compatibility.
> >
> > * [Bug] [Line 292]. "The null namespace may not be used in a
> > dot-separated sequence of names". You defined previously
> > null namespace as a empty string instead of *whole* namespace. I. e.
> > null namespace is lack of namespace (i. e. lack of
> > whole dot-separated sequence). So there is no sense in speaking on
> > using null namespace in dot-separated sequence. You
> > probably mean that one should not use empty string in dot-separated sequence
> >
> > That doesn't read as a bug at all to me, you have introduce a "whole
> > namespace" and only then does it not make sense
> >
> > a.b.c - a is the top level namespace, b is a namespace in a, c is a
> > namespace in b.
> >
> > Probably easier to read up on C++ namespaces and nesting and full
> > specification to see how your introduction of "whole namespace" where
> > it doesn't exist is what is causing your confusion.
> >
> > I look forward to more bug reports and pull requests to iron out typos
> > and standardize existing practice (you might want to read through
> > other implementations to make sure your suggestions are useful to
> > the community at large.)
> >
> > On Sat, Feb 12, 2022 at 5:22 PM Askar Safin <safinas...@mail.ru> wrote:
> > >
> > > Hi. I'm writing my own Avro implementation in Rust for personal use. 
> > > During this work I found a lot of issues in Avro
> > > spec (the list follows).
> > >
> > > I send this mail not only to user and dev mailing lists of Avro, but also 
> > > to Apache community list, Kafka list and 3
> > > semi-randomly chosen Materialize employees. Because I want to draw 
> > > attention to this problems. I hope this wider
> > > community helps Avro fix their issues and possible will give necessary 
> > > resources.
> > >
> > > As well as I understand Avro is used in Kafka. And Kafka, according to 
> > > their site, is used in "80% of all Fortune 100
> > > companies". So Avro is critical piece of infrastructure of humanity. It 
> > > should be absolutely perfect (and so I list
> > > even very small bugs). But it is not perfect.
> > >
> > > Some of items in this list are (small and big) bugs, some are typos, some 
> > > are my objections to the design. Some are
> > > fixable while keeping compatibility, some are not. I don't want to spend 
> > > my time to report them as separate bugs, but
> > > you can try to convince me to do so.
> > >
> > > I created this list simply by reading the spec from end to end (I skipped 
> > > sections on RPC and logical types). And I
> > > even didn't look at implementations!
> > >
> > > I write this is hope to help Avro.
> > >
> > > I think big audit of spec and its implementations should be done.
> > >
> > > All line numbers apply to spec.xml from tag release-1.11.0 (i. e.
> > > https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml
> > >  ). As well as I understand this tag
> > > corresponds to currently published version at 
> > > https://avro.apache.org/docs/current/spec.html .
> > >
> > > So, here we go.
> > >
> > > * [Opinion] [No line]. In Avro one have to define named records inside 
> > > each other like so:
> > >
> > > { "type": "record", "name": "a", "fields": 
> > > [{"name":"b","type":{"type":"record","name":"c",...}}] }
> > >
> > > This is very unnatural. In popular programming languages one usually 
> > > define named record next to each other, not one
> > > inside other. Such representation is not handy to deal programmatically. 
> > > In my implementation I have to convert this
> > > representation to usual form "root type + list of named types" right 
> > > after reading JSON and convert back just before
> > > writing.
> > >
> > > * [Opinion] [No line]. In this list you will see a lot of questions on 
> > > Avro schema (encoded as JSON). Good JSON schema
> > > ( https://json-schema.org/ ) would resolve many of them
> > >
> > > * [Seems to be bug] [Line 49]. "derived type name" is vague term. In 
> > > fact, in whole spec phrase "type name" is used
> > > very vaguely. Sometimes it means strings like "record" and sometimes it 
> > > means names of named types. I propose to define
> > > in very beginning of the spec terms for primitive types, terms for 
> > > strings like "record" and terms for names of defined
> > > types. Here is one possible way to do this: name strings like "record" 
> > > and "fixed" "type kinds" and never name them
> > > type names, thus reserving term "type name" to named types only (and 
> > > possibly to primitive types).
> > >
> > > This issue already caused problems: look, for example, to this problems 
> > > with {"type":"record","name":"record",...}:
> > > https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .
> > >
> > > * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit 
> > > integers. Such type is present in languages such
> > > as C and Rust
> > >
> > > * [Very theoretical bug, possible even security-related] [Line 435]. "The 
> > > float is converted into a 32-bit integer
> > > using a method equivalent to Java's floatToIntBits and then encoded in 
> > > little-endian format". If we click at provided
> > > link, we will see that this Java function does NaN normalization. I think 
> > > NaN normalization is good thing. But I think
> > > this quite possible spec implementers overlooked this NaN normalization 
> > > requirement. So I propose: write explicitly
> > > directly in Avro spec that NaN are normalized. Audit all Avro 
> > > implementations: whether they actually implemented this
> > > requirement. Create tests, which will actually test this requirement.
> > >
> > > Also I don't know whether bit pattern provided in that Java doc 
> > > (0x7fc00000) is quiet NaN or signaling. If it is
> > > signaling, this is very bad.
> > >
> > > As well as I understand if you will configure your FPU particular way 
> > > than merely copying signaling NaN from one place
> > > to another will abort your program. So, if your FPU is configured certain 
> > > way then feeding particular binary Avro data
> > > to a program can crash it. I. e. this is security problem. So a reader 
> > > should be careful to check whether input data is
> > > signaling NaN *before* storing it in floating point registers.
> > >
> > > I checked whether manipulating signaling NaN can actually crash a program 
> > > in default settings in Windows and Linux. And
> > > it turned out that a program will not crash. Still I think signaling NaN 
> > > should be handled carefully.
> > >
> > > Write to spec that writers should normalize NaNs, that readers should 
> > > reject non-normalized NaNs and that readers
> > > should be careful not to store incoming floating number to floating-point 
> > > variable before its sanitizing. Write that
> > > this is security issue.
> > >
> > > * [Opinion] [Line 68]. "unicode character sequence". As well as I 
> > > understand Unicode character sequence means sequence
> > > of Unicode scalar values. Note that scalar value is not same thing as 
> > > code point. Unfortunately, some people don't know
> > > this, so please write explicitly: "this is sequence of scalar values, not 
> > > code points", to make sure implementations
> > > will be correct
> > >
> > > * [Bug] [Line 71]. "Primitive types have no specified attributes". This 
> > > is lie. At line 1527 you specify logical type
> > > based on primitive type int. Thus you specify particular meaning of 
> > > attribute "logicalType" for primitive type "int".
> > > Be careful at your wording. The spec should be rock-solid
> > >
> > > * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing 
> > > alternate names for this record (optional)". Is
> > > empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may 
> > > say this is nitpicking, but I don't think so.
> > > Avro has important place in our infrastructure, so everything is 
> > > important. Think carefully whether empty list (and
> > > duplicates) is allowed everywhere in the spec where you see some kind of 
> > > list. I think empty arrays (and duplicates)
> > > should be disallowed in this particular case. Because the more things we 
> > > allow, the bigger is attack surface
> > >
> > > * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many 
> > > fields allowed? Already reported by me at
> > > https://issues.apache.org/jira/browse/AVRO-3279
> > >
> > > * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already 
> > > reported by me at
> > > https://issues.apache.org/jira/browse/AVRO-3280
> > >
> > > * [Bug] [Line 101]. "name: a JSON string providing the name of the field 
> > > (required), and". Word "and" usually placed
> > > immediately before last item in sequence. The text here looks like item 
> > > "doc" was last, but then the text was edited
> > > not carefully. This is very stupid typographic issue which shows that 
> > > authors are not careful about spec quality. Also,
> > > the spec is not consistent on placing dots after items (this applies to 
> > > whole spec). Sometimes I see nothing in the end
> > > of item, sometimes "." and sometimes ";"
> > >
> > > * [Opinion] [Line 106]. "default: A default value for..." What follows is 
> > > essentially description of JSON
> > > representation of Avro datum (except for unions). So, you managed to put 
> > > very important part of your spec directly into
> > > one paragraph into second level bullet point?!
> > >
> > > * [Opinion] [Line 112]. "Default values for union fields correspond to 
> > > the first schema in the union". This phrase is
> > > difference between JSON encoding for Avro data and JSON encoding for 
> > > default field. And, of course, presence of this
> > > difference is design bug
> > >
> > > * [Opinion] [Line 113]. "Default values for bytes and fixed fields are 
> > > JSON strings, where Unicode code points 0-255
> > > are mapped to unsigned 8-bit byte values 0-255". Wat? This is very 
> > > unnatural encoding. You misuse JSON string. They are
> > > for strings, they are not for binary data. You should use array of 
> > > numbers instead. I. e. encode bytes 0x0f 0x02 as
> > > [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C 
> > > programs have difficulties with such strings
> > >
> > > * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what 
> > > if given long is not a JSON-safe integer? As
> > > we know from https://datatracker.ietf.org/doc/html/rfc7159 integers 
> > > outside of range [-(2**53)+1, (2**53)-1] are not
> > > JSON-safe
> > >
> > > * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, 
> > > despite they seems to be allowed by Avro spec.
> > > So, JSON representation of Avro data is incomplete
> > >
> > > * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined 
> > > by spec. For unknown reasons you decided to
> > > insert essentially whole description of JSON representation of Avro data 
> > > into one small paragraph even before type
> > > system is fully described. Please use terms only after their definition
> > >
> > > * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as 
> > > optional or required. Do you ever read your spec?
> > >
> > > * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word 
> > > "namespace" is marked using <em>, not <code>, and we
> > > can see this in rendered version. This is very stupid typographic bug, 
> > > which is immediately obvious to anybody reading
> > > this document, even to non-technical people
> > >
> > > * [Bug] [Line 200]. "a single attribute". As we see in provided example, 
> > > "default" is allowed, too. What is meaning of
> > > this "default" attribute? And how its meaning differs from meaning of 
> > > "default" key in field description? (Same for
> > > maps)
> > >
> > > * [Bug] [Line 238]. "declares a schema which may be either a null or 
> > > string". Lie. Schema is ["null", "string"].
> > > *Value* may be a null or string. Please check that you don't confuse 
> > > types (schemas) with value through whole your spec
> > >
> > > * [Very opinionated opinion :)] [Line 235]. I don't like your unions at 
> > > all. I'm coming from languages like Haskell and
> > > Rust, where true sum types are supported. They are similar to your 
> > > unions, but their alternatives are named.
> > > Alternatives are identified by their names, so there is no restriction on 
> > > duplicate types. So there is no need for very
> > > unnatural restriction "Unions may not contain more than one schema with 
> > > the same type, except for the named types
> > > record, fixed and enum"
> > >
> > > * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, 
> > > providing alternate names for this enum". You
> > > mean "fixed", right? So, you copy-pasted section on enums? Do you ever 
> > > read your spec from end to end at least one time?
> > >
> > > * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per 
> > > value". Is zero allowed?
> > >
> > > * [Bug] [Line 292]. "The null namespace may not be used in a 
> > > dot-separated sequence of names". You defined previously
> > > null namespace as a empty string instead of *whole* namespace. I. e. null 
> > > namespace is lack of namespace (i. e. lack of
> > > whole dot-separated sequence). So there is no sense in speaking on using 
> > > null namespace in dot-separated sequence. You
> > > probably mean that one should not use empty string in dot-separated 
> > > sequence
> > >
> > > * [Bug] [Line 374]. "Deserializing data into a newer schema is 
> > > accomplished by specifying an additional schema, the
> > > results of which are described in Schema Resolution". Term "additional 
> > > schema" is vague here. I would say so:
> > > "Deserializing data into a newer schema is accomplished by using an 
> > > algorithm described in Schema Resolution"
> > >
> > > * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to 
> > > read Avro data with a schema that does not..."
> > > The whole paragraph is very vague. At first reading I thought that it is 
> > > about schema resolution. After several
> > > attempts to understand it I finally understood that the paragraph is 
> > > about reading attempts without original writer
> > > schema available at all. I propose removing whole paragraph or rewriting 
> > > it completely
> > >
> > > * [Bug] [Line 385]. "For example, int and long are always serialized the 
> > > same way". What this means? You probably mean
> > > that *same* int and long (i. e. int and long, which are numerically 
> > > identical) serialized the same way.
> > >
> > > * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. 
> > > Do you mean no bytes at all? Or null bytes,
> > > i. e. some undefined number of null bytes? (Of course, I understand that 
> > > you mean the first variant, but I still don't
> > > like the phrase)
> > >
> > > * [Bug] [Line 417]. "int and long values are written..." Is canonical (i. 
> > > e. smallest) encoding of numbers required?
> > > Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307
> > >
> > > I still think canonical representations should be required. The more 
> > > forms of encoding you allow the bigger is attack
> > > surface.
> > >
> > > Also, it would be desirable property for binary representations be equal 
> > > when data is equal. It would be good if you
> > > guarantee this property at least for some subset of schemas (and of 
> > > course, you should write explicitly for which
> > > schemas the property is guaranteed). Non-canonical representations break 
> > > this property
> > >
> > > * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is 
> > > sequence of encoded scalar values (don't confuse
> > > with code points). Unfortunately, not everybody knows this, and thus we 
> > > see WTF-8 (i. e. encoding similar to UTF-8, but
> > > with standalone surrogates) available in places where proper UTF-8 should 
> > > reside. So everywhere where the spec says
> > > "UTF-8" I propose to explicitly write that standalone surrogates are not 
> > > allowed and that readers should fail if they
> > > find them (I prefer to place this sentence to introduction of the spec)
> > >
> > > * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions 
> > > are practically an int, but theoretically a
> > > long". Wat? So, other implementations use int (as per spec), but C++ uses 
> > > long, right? So, go fix C++ implementation to
> > > match spec and other implementations
> > >
> > > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a 
> > > JSON null". There is no reasons to special-case
> > > nulls. This is additional requirement, which adds complexity to 
> > > implementations without any reasons
> > >
> > > * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." 
> > > It seems I found a real bug. :) Consider this
> > > schema:
> > >
> > > [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]
> > >
> > > As well as I understand such schema fully allowed. Now consider this 
> > > encoded value: {"map":{"b":0}}. What is it? Map or
> > > record named "map"?
> > >
> > > * [Bug] [Line 677]. "data is ordered by ascending numeric value". What 
> > > about NaNs?
> > >
> > > * [Bug] [Line 682]. "compared lexicographically by Unicode code point". 
> > > Replace with scalar values. UTF-8 consists of
> > > scalar values
> > >
> > > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for 
> > > this file". I don't like this point. It
> > > implies that container files are usually not equal. Thus it is not 
> > > possible to compare them bitwise to determine
> > > equality. So, in my Avro implementation I write null bytes instead of 
> > > this marker (yes, this possibly means that my
> > > implementation is non-conforming)
> > >
> > > * [Opinion] [Line 717]. There is no any marker for end of container file. 
> > > Thus there is no way to determine whether all
> > > data was written
> > >
> > > * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable 
> > > to string". Wat? How these values are promoted?
> > >
> > > * [Bug] [Line 1153]. What implementation should do (when it does schema 
> > > resolution)? It should first check that schemas
> > > match (and report any errors) and then read data? Or proceed straight to 
> > > reading data? This is important distinction.
> > > For example, what happens when we attempt to read file container without 
> > > data elements using schema resolution
> > > algorithm? (Are such container allowed, by the way?) In the first case 
> > > scheme check should be performed. In the second
> > > such reading should always be successful.
> > >
> > > If you think the first case is correct, then the section should describe 
> > > algorithm for determining matching of schemas
> > > separately from algorithm of actual reading data
> > >
> > > * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" 
> > > instead of {"type":"int"}>>, right?
> > >
> > > * [Bug] [Line 1331]. "replace any escaped characters". Any? What about 
> > > "a\"b"? It is impossible to replace :)
> > >
> > > ----
> > >
> > > Some notes about my task. I want to implement this:
> > > https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . 
> > > I. e. I want to have shell utils in Linux,
> > > which exchange some structured data. Here is how I chose format for 
> > > representing that data.
> > >
> > > * I want format to be binary, not textual, this rules out JSON, XML, etc
> > > * I want format to be typed, this rules out CBOR etc
> > > * I want format to have support for proper sum types (similar to 
> > > Haskell's), this rules out Microsoft Bond. As well as
> > > I understand this also rules out using GVariants, proposed in above 
> > > mentioned article. And this rules out Protobuf:
> > > Protobuf has support for sum types (they are named OneOf), but this 
> > > OneOfs are always optional (speaking in Avro
> > > language: you always get ["null", "int", "bool"] instead of ["int", 
> > > "bool"])
> > > * I want format to have support for recursive types, this rules out Bare 
> > > ( baremessages.org )
> > >
> > > So, we have not so many formats left. Avro and possibly a few more. And I 
> > > chose Avro. And I really like it. Because:
> > >
> > > * It is very compact
> > > * It has very elegant way to support schema evolution (as opposed to 
> > > Protobuf, where fields are tagged, i. e. you trade
> > > space efficiency for schema evolution)
> > > * It has container format with schema attached
> > > * You don't need to write items count to container header (good for 
> > > streaming)
> > >
> > > So, Avro is simply *best* for my task. But then I discovered its problems 
> > > (listed above). How it is happened that such
> > > good format has so bad spec? How it is happened that *best* format for 
> > > this task happened to be so bad? What this says
> > > about our industry?
> > >
> > > ==
> > > Askar Safin
> > > http://safinaskar.com
> > > https://sr.ht/~safinaskar
> > > https://github.com/safinaskar

Re: Gigantic list of Avro spec issues

Reply via email to