Re: Gigantic list of Avro spec issues

Dan Schmitt Sun, 13 Feb 2022 09:05:46 -0800

I will admit the spec is likely weak around unicode/utf encoding,
(as the serialization of strings to json isn't consistent across all
language bindings) but I file the ticket against implementations
and write test cases rather than make guesses at the spec
wording and make demands without knowing it's a real issue.


On Sat, Feb 12, 2022 at 6:55 PM Dan Schmitt <dan.schm...@gmail.com> wrote:
>
> Generally the "I'm going to lump all my complaints into
> one big bug" is a good way to get them ignored.
>
> I'll skip "the design is wrong and it should change because
> I don't like it" and cite "it's used everywhere with lots of
> implementations so you can't change it in an incompatible way".
>
> I'll skip obvious typos and suggest you catch more flies with
> honey than vinegar, and you can work a git repository and
> make a pull request for those and they'd be fixed fast.  (Or
> read 2 or 3 of the 8 implementations if the phrasing is confusing
> to you.)
>
> I'll make some suggestions on the technical details
> So, specifically:
>
> * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker
> for this file". I don't like this point. It
> implies that container files are usually not equal. Thus it is not
> possible to compare them bitwise to determine
> equality. So, in my Avro implementation I write null bytes instead of
> this marker (yes, this possibly means that my
> implementation is non-conforming)
> * [Opinion] [Line 717]. There is no any marker for end of container
> file. Thus there is no way to determine whether all
> data was written
>
> If you use the sync marker, you don't need an end of container marker.
> (Flush the sync and container block map with new data after the new
> block is written, if you have the metadata and block list you know that
> much is complete and written for you to read, if you read the metadata
> and your sync byte marker is wrong, go re-read/continue read.)
>
> * [Very theoretical bug, possible even security-related] [Line 435].
> Since you have a test case that proves it doesn't crash systems, it's
> sort of not a bug right?  You could at the test case to the test suite.
>
> * [Bug] [Line 572]. "Currently for C/C++ implementations, the
> positions are practically an int, but theoretically a
> long". Wat? So, other implementations use int (as per spec), but C++
> uses long, right? So, go fix C++ implementation to
> match spec and other implementations
>
> This is not a bug, but an acknowledgement that the C/C++ offsets
> are internally implemented via pointer math to be efficient but if
> you try to read in enough data that a long offset makes sense,
> you will be sad/run out of memory.   That the internal implementation
> for C/C++ supports the minimum required by the specification.
>
> * [Bug] [Line 385]. "For example, int and long are always serialized
> the same way". What this means? You probably mean
> that *same* int and long (i. e. int and long, which are numerically
> identical) serialized the same way.
>
> That rewrite is wrong.  Your wording would allow serialization to be
> altered by value (e.g. it would be allowable to use big endian storage
> for odd numbers and little endian for even as each same int and long
> would be serialized the same way.)
>
> * [Opinion] [Line 596]. "if its type is null, then it is encoded as a
> JSON null". There is no reasons to special-case
> nulls. This is additional requirement, which adds complexity to
> implementations without any reasons
>
> You are making assumptions about implementation encoding of null.
> A C developer would say writing 0x00 to the file that you will read back
> later is fine for null or false or 0.
>
> * [Bug] [Line 417]. - can't take action, the request breaks compatibility.
>
> * [Bug] [Line 292]. "The null namespace may not be used in a
> dot-separated sequence of names". You defined previously
> null namespace as a empty string instead of *whole* namespace. I. e.
> null namespace is lack of namespace (i. e. lack of
> whole dot-separated sequence). So there is no sense in speaking on
> using null namespace in dot-separated sequence. You
> probably mean that one should not use empty string in dot-separated sequence
>
> That doesn't read as a bug at all to me, you have introduce a "whole
> namespace" and only then does it not make sense
>
> a.b.c - a is the top level namespace, b is a namespace in a, c is a
> namespace in b.
>
> Probably easier to read up on C++ namespaces and nesting and full
> specification to see how your introduction of "whole namespace" where
> it doesn't exist is what is causing your confusion.
>
> I look forward to more bug reports and pull requests to iron out typos
> and standardize existing practice (you might want to read through
> other implementations to make sure your suggestions are useful to
> the community at large.)
>
> On Sat, Feb 12, 2022 at 5:22 PM Askar Safin <safinas...@mail.ru> wrote:
> >
> > Hi. I'm writing my own Avro implementation in Rust for personal use. During 
> > this work I found a lot of issues in Avro
> > spec (the list follows).
> >
> > I send this mail not only to user and dev mailing lists of Avro, but also 
> > to Apache community list, Kafka list and 3
> > semi-randomly chosen Materialize employees. Because I want to draw 
> > attention to this problems. I hope this wider
> > community helps Avro fix their issues and possible will give necessary 
> > resources.
> >
> > As well as I understand Avro is used in Kafka. And Kafka, according to 
> > their site, is used in "80% of all Fortune 100
> > companies". So Avro is critical piece of infrastructure of humanity. It 
> > should be absolutely perfect (and so I list
> > even very small bugs). But it is not perfect.
> >
> > Some of items in this list are (small and big) bugs, some are typos, some 
> > are my objections to the design. Some are
> > fixable while keeping compatibility, some are not. I don't want to spend my 
> > time to report them as separate bugs, but
> > you can try to convince me to do so.
> >
> > I created this list simply by reading the spec from end to end (I skipped 
> > sections on RPC and logical types). And I
> > even didn't look at implementations!
> >
> > I write this is hope to help Avro.
> >
> > I think big audit of spec and its implementations should be done.
> >
> > All line numbers apply to spec.xml from tag release-1.11.0 (i. e.
> > https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml
> >  ). As well as I understand this tag
> > corresponds to currently published version at 
> > https://avro.apache.org/docs/current/spec.html .
> >
> > So, here we go.
> >
> > * [Opinion] [No line]. In Avro one have to define named records inside each 
> > other like so:
> >
> > { "type": "record", "name": "a", "fields": 
> > [{"name":"b","type":{"type":"record","name":"c",...}}] }
> >
> > This is very unnatural. In popular programming languages one usually define 
> > named record next to each other, not one
> > inside other. Such representation is not handy to deal programmatically. In 
> > my implementation I have to convert this
> > representation to usual form "root type + list of named types" right after 
> > reading JSON and convert back just before
> > writing.
> >
> > * [Opinion] [No line]. In this list you will see a lot of questions on Avro 
> > schema (encoded as JSON). Good JSON schema
> > ( https://json-schema.org/ ) would resolve many of them
> >
> > * [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact, 
> > in whole spec phrase "type name" is used
> > very vaguely. Sometimes it means strings like "record" and sometimes it 
> > means names of named types. I propose to define
> > in very beginning of the spec terms for primitive types, terms for strings 
> > like "record" and terms for names of defined
> > types. Here is one possible way to do this: name strings like "record" and 
> > "fixed" "type kinds" and never name them
> > type names, thus reserving term "type name" to named types only (and 
> > possibly to primitive types).
> >
> > This issue already caused problems: look, for example, to this problems 
> > with {"type":"record","name":"record",...}:
> > https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 .
> >
> > * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit 
> > integers. Such type is present in languages such
> > as C and Rust
> >
> > * [Very theoretical bug, possible even security-related] [Line 435]. "The 
> > float is converted into a 32-bit integer
> > using a method equivalent to Java's floatToIntBits and then encoded in 
> > little-endian format". If we click at provided
> > link, we will see that this Java function does NaN normalization. I think 
> > NaN normalization is good thing. But I think
> > this quite possible spec implementers overlooked this NaN normalization 
> > requirement. So I propose: write explicitly
> > directly in Avro spec that NaN are normalized. Audit all Avro 
> > implementations: whether they actually implemented this
> > requirement. Create tests, which will actually test this requirement.
> >
> > Also I don't know whether bit pattern provided in that Java doc 
> > (0x7fc00000) is quiet NaN or signaling. If it is
> > signaling, this is very bad.
> >
> > As well as I understand if you will configure your FPU particular way than 
> > merely copying signaling NaN from one place
> > to another will abort your program. So, if your FPU is configured certain 
> > way then feeding particular binary Avro data
> > to a program can crash it. I. e. this is security problem. So a reader 
> > should be careful to check whether input data is
> > signaling NaN *before* storing it in floating point registers.
> >
> > I checked whether manipulating signaling NaN can actually crash a program 
> > in default settings in Windows and Linux. And
> > it turned out that a program will not crash. Still I think signaling NaN 
> > should be handled carefully.
> >
> > Write to spec that writers should normalize NaNs, that readers should 
> > reject non-normalized NaNs and that readers
> > should be careful not to store incoming floating number to floating-point 
> > variable before its sanitizing. Write that
> > this is security issue.
> >
> > * [Opinion] [Line 68]. "unicode character sequence". As well as I 
> > understand Unicode character sequence means sequence
> > of Unicode scalar values. Note that scalar value is not same thing as code 
> > point. Unfortunately, some people don't know
> > this, so please write explicitly: "this is sequence of scalar values, not 
> > code points", to make sure implementations
> > will be correct
> >
> > * [Bug] [Line 71]. "Primitive types have no specified attributes". This is 
> > lie. At line 1527 you specify logical type
> > based on primitive type int. Thus you specify particular meaning of 
> > attribute "logicalType" for primitive type "int".
> > Be careful at your wording. The spec should be rock-solid
> >
> > * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing 
> > alternate names for this record (optional)". Is
> > empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say 
> > this is nitpicking, but I don't think so.
> > Avro has important place in our infrastructure, so everything is important. 
> > Think carefully whether empty list (and
> > duplicates) is allowed everywhere in the spec where you see some kind of 
> > list. I think empty arrays (and duplicates)
> > should be disallowed in this particular case. Because the more things we 
> > allow, the bigger is attack surface
> >
> > * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields 
> > allowed? Already reported by me at
> > https://issues.apache.org/jira/browse/AVRO-3279
> >
> > * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already 
> > reported by me at
> > https://issues.apache.org/jira/browse/AVRO-3280
> >
> > * [Bug] [Line 101]. "name: a JSON string providing the name of the field 
> > (required), and". Word "and" usually placed
> > immediately before last item in sequence. The text here looks like item 
> > "doc" was last, but then the text was edited
> > not carefully. This is very stupid typographic issue which shows that 
> > authors are not careful about spec quality. Also,
> > the spec is not consistent on placing dots after items (this applies to 
> > whole spec). Sometimes I see nothing in the end
> > of item, sometimes "." and sometimes ";"
> >
> > * [Opinion] [Line 106]. "default: A default value for..." What follows is 
> > essentially description of JSON
> > representation of Avro datum (except for unions). So, you managed to put 
> > very important part of your spec directly into
> > one paragraph into second level bullet point?!
> >
> > * [Opinion] [Line 112]. "Default values for union fields correspond to the 
> > first schema in the union". This phrase is
> > difference between JSON encoding for Avro data and JSON encoding for 
> > default field. And, of course, presence of this
> > difference is design bug
> >
> > * [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON 
> > strings, where Unicode code points 0-255
> > are mapped to unsigned 8-bit byte values 0-255". Wat? This is very 
> > unnatural encoding. You misuse JSON string. They are
> > for strings, they are not for binary data. You should use array of numbers 
> > instead. I. e. encode bytes 0x0f 0x02 as
> > [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C 
> > programs have difficulties with such strings
> >
> > * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what 
> > if given long is not a JSON-safe integer? As
> > we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside 
> > of range [-(2**53)+1, (2**53)-1] are not
> > JSON-safe
> >
> > * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, 
> > despite they seems to be allowed by Avro spec.
> > So, JSON representation of Avro data is incomplete
> >
> > * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined 
> > by spec. For unknown reasons you decided to
> > insert essentially whole description of JSON representation of Avro data 
> > into one small paragraph even before type
> > system is fully described. Please use terms only after their definition
> >
> > * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional 
> > or required. Do you ever read your spec?
> >
> > * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word "namespace" 
> > is marked using <em>, not <code>, and we
> > can see this in rendered version. This is very stupid typographic bug, 
> > which is immediately obvious to anybody reading
> > this document, even to non-technical people
> >
> > * [Bug] [Line 200]. "a single attribute". As we see in provided example, 
> > "default" is allowed, too. What is meaning of
> > this "default" attribute? And how its meaning differs from meaning of 
> > "default" key in field description? (Same for
> > maps)
> >
> > * [Bug] [Line 238]. "declares a schema which may be either a null or 
> > string". Lie. Schema is ["null", "string"].
> > *Value* may be a null or string. Please check that you don't confuse types 
> > (schemas) with value through whole your spec
> >
> > * [Very opinionated opinion :)] [Line 235]. I don't like your unions at 
> > all. I'm coming from languages like Haskell and
> > Rust, where true sum types are supported. They are similar to your unions, 
> > but their alternatives are named.
> > Alternatives are identified by their names, so there is no restriction on 
> > duplicate types. So there is no need for very
> > unnatural restriction "Unions may not contain more than one schema with the 
> > same type, except for the named types
> > record, fixed and enum"
> >
> > * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, 
> > providing alternate names for this enum". You
> > mean "fixed", right? So, you copy-pasted section on enums? Do you ever read 
> > your spec from end to end at least one time?
> >
> > * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per 
> > value". Is zero allowed?
> >
> > * [Bug] [Line 292]. "The null namespace may not be used in a dot-separated 
> > sequence of names". You defined previously
> > null namespace as a empty string instead of *whole* namespace. I. e. null 
> > namespace is lack of namespace (i. e. lack of
> > whole dot-separated sequence). So there is no sense in speaking on using 
> > null namespace in dot-separated sequence. You
> > probably mean that one should not use empty string in dot-separated sequence
> >
> > * [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished 
> > by specifying an additional schema, the
> > results of which are described in Schema Resolution". Term "additional 
> > schema" is vague here. I would say so:
> > "Deserializing data into a newer schema is accomplished by using an 
> > algorithm described in Schema Resolution"
> >
> > * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to 
> > read Avro data with a schema that does not..."
> > The whole paragraph is very vague. At first reading I thought that it is 
> > about schema resolution. After several
> > attempts to understand it I finally understood that the paragraph is about 
> > reading attempts without original writer
> > schema available at all. I propose removing whole paragraph or rewriting it 
> > completely
> >
> > * [Bug] [Line 385]. "For example, int and long are always serialized the 
> > same way". What this means? You probably mean
> > that *same* int and long (i. e. int and long, which are numerically 
> > identical) serialized the same way.
> >
> > * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. 
> > Do you mean no bytes at all? Or null bytes,
> > i. e. some undefined number of null bytes? (Of course, I understand that 
> > you mean the first variant, but I still don't
> > like the phrase)
> >
> > * [Bug] [Line 417]. "int and long values are written..." Is canonical (i. 
> > e. smallest) encoding of numbers required?
> > Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307
> >
> > I still think canonical representations should be required. The more forms 
> > of encoding you allow the bigger is attack
> > surface.
> >
> > Also, it would be desirable property for binary representations be equal 
> > when data is equal. It would be good if you
> > guarantee this property at least for some subset of schemas (and of course, 
> > you should write explicitly for which
> > schemas the property is guaranteed). Non-canonical representations break 
> > this property
> >
> > * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is 
> > sequence of encoded scalar values (don't confuse
> > with code points). Unfortunately, not everybody knows this, and thus we see 
> > WTF-8 (i. e. encoding similar to UTF-8, but
> > with standalone surrogates) available in places where proper UTF-8 should 
> > reside. So everywhere where the spec says
> > "UTF-8" I propose to explicitly write that standalone surrogates are not 
> > allowed and that readers should fail if they
> > find them (I prefer to place this sentence to introduction of the spec)
> >
> > * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are 
> > practically an int, but theoretically a
> > long". Wat? So, other implementations use int (as per spec), but C++ uses 
> > long, right? So, go fix C++ implementation to
> > match spec and other implementations
> >
> > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON 
> > null". There is no reasons to special-case
> > nulls. This is additional requirement, which adds complexity to 
> > implementations without any reasons
> >
> > * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It 
> > seems I found a real bug. :) Consider this
> > schema:
> >
> > [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}]
> >
> > As well as I understand such schema fully allowed. Now consider this 
> > encoded value: {"map":{"b":0}}. What is it? Map or
> > record named "map"?
> >
> > * [Bug] [Line 677]. "data is ordered by ascending numeric value". What 
> > about NaNs?
> >
> > * [Bug] [Line 682]. "compared lexicographically by Unicode code point". 
> > Replace with scalar values. UTF-8 consists of
> > scalar values
> >
> > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for 
> > this file". I don't like this point. It
> > implies that container files are usually not equal. Thus it is not possible 
> > to compare them bitwise to determine
> > equality. So, in my Avro implementation I write null bytes instead of this 
> > marker (yes, this possibly means that my
> > implementation is non-conforming)
> >
> > * [Opinion] [Line 717]. There is no any marker for end of container file. 
> > Thus there is no way to determine whether all
> > data was written
> >
> > * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to 
> > string". Wat? How these values are promoted?
> >
> > * [Bug] [Line 1153]. What implementation should do (when it does schema 
> > resolution)? It should first check that schemas
> > match (and report any errors) and then read data? Or proceed straight to 
> > reading data? This is important distinction.
> > For example, what happens when we attempt to read file container without 
> > data elements using schema resolution
> > algorithm? (Are such container allowed, by the way?) In the first case 
> > scheme check should be performed. In the second
> > such reading should always be successful.
> >
> > If you think the first case is correct, then the section should describe 
> > algorithm for determining matching of schemas
> > separately from algorithm of actual reading data
> >
> > * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" 
> > instead of {"type":"int"}>>, right?
> >
> > * [Bug] [Line 1331]. "replace any escaped characters". Any? What about 
> > "a\"b"? It is impossible to replace :)
> >
> > ----
> >
> > Some notes about my task. I want to implement this:
> > https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . 
> > I. e. I want to have shell utils in Linux,
> > which exchange some structured data. Here is how I chose format for 
> > representing that data.
> >
> > * I want format to be binary, not textual, this rules out JSON, XML, etc
> > * I want format to be typed, this rules out CBOR etc
> > * I want format to have support for proper sum types (similar to 
> > Haskell's), this rules out Microsoft Bond. As well as
> > I understand this also rules out using GVariants, proposed in above 
> > mentioned article. And this rules out Protobuf:
> > Protobuf has support for sum types (they are named OneOf), but this OneOfs 
> > are always optional (speaking in Avro
> > language: you always get ["null", "int", "bool"] instead of ["int", "bool"])
> > * I want format to have support for recursive types, this rules out Bare ( 
> > baremessages.org )
> >
> > So, we have not so many formats left. Avro and possibly a few more. And I 
> > chose Avro. And I really like it. Because:
> >
> > * It is very compact
> > * It has very elegant way to support schema evolution (as opposed to 
> > Protobuf, where fields are tagged, i. e. you trade
> > space efficiency for schema evolution)
> > * It has container format with schema attached
> > * You don't need to write items count to container header (good for 
> > streaming)
> >
> > So, Avro is simply *best* for my task. But then I discovered its problems 
> > (listed above). How it is happened that such
> > good format has so bad spec? How it is happened that *best* format for this 
> > task happened to be so bad? What this says
> > about our industry?
> >
> > ==
> > Askar Safin
> > http://safinaskar.com
> > https://sr.ht/~safinaskar
> > https://github.com/safinaskar

Re: Gigantic list of Avro spec issues

Reply via email to