I will admit the spec is likely weak around unicode/utf encoding, (as the serialization of strings to json isn't consistent across all language bindings) but I file the ticket against implementations and write test cases rather than make guesses at the spec wording and make demands without knowing it's a real issue.
On Sat, Feb 12, 2022 at 6:55 PM Dan Schmitt <dan.schm...@gmail.com> wrote: > > Generally the "I'm going to lump all my complaints into > one big bug" is a good way to get them ignored. > > I'll skip "the design is wrong and it should change because > I don't like it" and cite "it's used everywhere with lots of > implementations so you can't change it in an incompatible way". > > I'll skip obvious typos and suggest you catch more flies with > honey than vinegar, and you can work a git repository and > make a pull request for those and they'd be fixed fast. (Or > read 2 or 3 of the 8 implementations if the phrasing is confusing > to you.) > > I'll make some suggestions on the technical details > So, specifically: > > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker > for this file". I don't like this point. It > implies that container files are usually not equal. Thus it is not > possible to compare them bitwise to determine > equality. So, in my Avro implementation I write null bytes instead of > this marker (yes, this possibly means that my > implementation is non-conforming) > * [Opinion] [Line 717]. There is no any marker for end of container > file. Thus there is no way to determine whether all > data was written > > If you use the sync marker, you don't need an end of container marker. > (Flush the sync and container block map with new data after the new > block is written, if you have the metadata and block list you know that > much is complete and written for you to read, if you read the metadata > and your sync byte marker is wrong, go re-read/continue read.) > > * [Very theoretical bug, possible even security-related] [Line 435]. > Since you have a test case that proves it doesn't crash systems, it's > sort of not a bug right? You could at the test case to the test suite. > > * [Bug] [Line 572]. "Currently for C/C++ implementations, the > positions are practically an int, but theoretically a > long". Wat? So, other implementations use int (as per spec), but C++ > uses long, right? So, go fix C++ implementation to > match spec and other implementations > > This is not a bug, but an acknowledgement that the C/C++ offsets > are internally implemented via pointer math to be efficient but if > you try to read in enough data that a long offset makes sense, > you will be sad/run out of memory. That the internal implementation > for C/C++ supports the minimum required by the specification. > > * [Bug] [Line 385]. "For example, int and long are always serialized > the same way". What this means? You probably mean > that *same* int and long (i. e. int and long, which are numerically > identical) serialized the same way. > > That rewrite is wrong. Your wording would allow serialization to be > altered by value (e.g. it would be allowable to use big endian storage > for odd numbers and little endian for even as each same int and long > would be serialized the same way.) > > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a > JSON null". There is no reasons to special-case > nulls. This is additional requirement, which adds complexity to > implementations without any reasons > > You are making assumptions about implementation encoding of null. > A C developer would say writing 0x00 to the file that you will read back > later is fine for null or false or 0. > > * [Bug] [Line 417]. - can't take action, the request breaks compatibility. > > * [Bug] [Line 292]. "The null namespace may not be used in a > dot-separated sequence of names". You defined previously > null namespace as a empty string instead of *whole* namespace. I. e. > null namespace is lack of namespace (i. e. lack of > whole dot-separated sequence). So there is no sense in speaking on > using null namespace in dot-separated sequence. You > probably mean that one should not use empty string in dot-separated sequence > > That doesn't read as a bug at all to me, you have introduce a "whole > namespace" and only then does it not make sense > > a.b.c - a is the top level namespace, b is a namespace in a, c is a > namespace in b. > > Probably easier to read up on C++ namespaces and nesting and full > specification to see how your introduction of "whole namespace" where > it doesn't exist is what is causing your confusion. > > I look forward to more bug reports and pull requests to iron out typos > and standardize existing practice (you might want to read through > other implementations to make sure your suggestions are useful to > the community at large.) > > On Sat, Feb 12, 2022 at 5:22 PM Askar Safin <safinas...@mail.ru> wrote: > > > > Hi. I'm writing my own Avro implementation in Rust for personal use. During > > this work I found a lot of issues in Avro > > spec (the list follows). > > > > I send this mail not only to user and dev mailing lists of Avro, but also > > to Apache community list, Kafka list and 3 > > semi-randomly chosen Materialize employees. Because I want to draw > > attention to this problems. I hope this wider > > community helps Avro fix their issues and possible will give necessary > > resources. > > > > As well as I understand Avro is used in Kafka. And Kafka, according to > > their site, is used in "80% of all Fortune 100 > > companies". So Avro is critical piece of infrastructure of humanity. It > > should be absolutely perfect (and so I list > > even very small bugs). But it is not perfect. > > > > Some of items in this list are (small and big) bugs, some are typos, some > > are my objections to the design. Some are > > fixable while keeping compatibility, some are not. I don't want to spend my > > time to report them as separate bugs, but > > you can try to convince me to do so. > > > > I created this list simply by reading the spec from end to end (I skipped > > sections on RPC and logical types). And I > > even didn't look at implementations! > > > > I write this is hope to help Avro. > > > > I think big audit of spec and its implementations should be done. > > > > All line numbers apply to spec.xml from tag release-1.11.0 (i. e. > > https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml > > ). As well as I understand this tag > > corresponds to currently published version at > > https://avro.apache.org/docs/current/spec.html . > > > > So, here we go. > > > > * [Opinion] [No line]. In Avro one have to define named records inside each > > other like so: > > > > { "type": "record", "name": "a", "fields": > > [{"name":"b","type":{"type":"record","name":"c",...}}] } > > > > This is very unnatural. In popular programming languages one usually define > > named record next to each other, not one > > inside other. Such representation is not handy to deal programmatically. In > > my implementation I have to convert this > > representation to usual form "root type + list of named types" right after > > reading JSON and convert back just before > > writing. > > > > * [Opinion] [No line]. In this list you will see a lot of questions on Avro > > schema (encoded as JSON). Good JSON schema > > ( https://json-schema.org/ ) would resolve many of them > > > > * [Seems to be bug] [Line 49]. "derived type name" is vague term. In fact, > > in whole spec phrase "type name" is used > > very vaguely. Sometimes it means strings like "record" and sometimes it > > means names of named types. I propose to define > > in very beginning of the spec terms for primitive types, terms for strings > > like "record" and terms for names of defined > > types. Here is one possible way to do this: name strings like "record" and > > "fixed" "type kinds" and never name them > > type names, thus reserving term "type name" to named types only (and > > possibly to primitive types). > > > > This issue already caused problems: look, for example, to this problems > > with {"type":"record","name":"record",...}: > > https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 . > > > > * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit > > integers. Such type is present in languages such > > as C and Rust > > > > * [Very theoretical bug, possible even security-related] [Line 435]. "The > > float is converted into a 32-bit integer > > using a method equivalent to Java's floatToIntBits and then encoded in > > little-endian format". If we click at provided > > link, we will see that this Java function does NaN normalization. I think > > NaN normalization is good thing. But I think > > this quite possible spec implementers overlooked this NaN normalization > > requirement. So I propose: write explicitly > > directly in Avro spec that NaN are normalized. Audit all Avro > > implementations: whether they actually implemented this > > requirement. Create tests, which will actually test this requirement. > > > > Also I don't know whether bit pattern provided in that Java doc > > (0x7fc00000) is quiet NaN or signaling. If it is > > signaling, this is very bad. > > > > As well as I understand if you will configure your FPU particular way than > > merely copying signaling NaN from one place > > to another will abort your program. So, if your FPU is configured certain > > way then feeding particular binary Avro data > > to a program can crash it. I. e. this is security problem. So a reader > > should be careful to check whether input data is > > signaling NaN *before* storing it in floating point registers. > > > > I checked whether manipulating signaling NaN can actually crash a program > > in default settings in Windows and Linux. And > > it turned out that a program will not crash. Still I think signaling NaN > > should be handled carefully. > > > > Write to spec that writers should normalize NaNs, that readers should > > reject non-normalized NaNs and that readers > > should be careful not to store incoming floating number to floating-point > > variable before its sanitizing. Write that > > this is security issue. > > > > * [Opinion] [Line 68]. "unicode character sequence". As well as I > > understand Unicode character sequence means sequence > > of Unicode scalar values. Note that scalar value is not same thing as code > > point. Unfortunately, some people don't know > > this, so please write explicitly: "this is sequence of scalar values, not > > code points", to make sure implementations > > will be correct > > > > * [Bug] [Line 71]. "Primitive types have no specified attributes". This is > > lie. At line 1527 you specify logical type > > based on primitive type int. Thus you specify particular meaning of > > attribute "logicalType" for primitive type "int". > > Be careful at your wording. The spec should be rock-solid > > > > * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing > > alternate names for this record (optional)". Is > > empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may say > > this is nitpicking, but I don't think so. > > Avro has important place in our infrastructure, so everything is important. > > Think carefully whether empty list (and > > duplicates) is allowed everywhere in the spec where you see some kind of > > list. I think empty arrays (and duplicates) > > should be disallowed in this particular case. Because the more things we > > allow, the bigger is attack surface > > > > * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many fields > > allowed? Already reported by me at > > https://issues.apache.org/jira/browse/AVRO-3279 > > > > * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already > > reported by me at > > https://issues.apache.org/jira/browse/AVRO-3280 > > > > * [Bug] [Line 101]. "name: a JSON string providing the name of the field > > (required), and". Word "and" usually placed > > immediately before last item in sequence. The text here looks like item > > "doc" was last, but then the text was edited > > not carefully. This is very stupid typographic issue which shows that > > authors are not careful about spec quality. Also, > > the spec is not consistent on placing dots after items (this applies to > > whole spec). Sometimes I see nothing in the end > > of item, sometimes "." and sometimes ";" > > > > * [Opinion] [Line 106]. "default: A default value for..." What follows is > > essentially description of JSON > > representation of Avro datum (except for unions). So, you managed to put > > very important part of your spec directly into > > one paragraph into second level bullet point?! > > > > * [Opinion] [Line 112]. "Default values for union fields correspond to the > > first schema in the union". This phrase is > > difference between JSON encoding for Avro data and JSON encoding for > > default field. And, of course, presence of this > > difference is design bug > > > > * [Opinion] [Line 113]. "Default values for bytes and fixed fields are JSON > > strings, where Unicode code points 0-255 > > are mapped to unsigned 8-bit byte values 0-255". Wat? This is very > > unnatural encoding. You misuse JSON string. They are > > for strings, they are not for binary data. You should use array of numbers > > instead. I. e. encode bytes 0x0f 0x02 as > > [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C > > programs have difficulties with such strings > > > > * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what > > if given long is not a JSON-safe integer? As > > we know from https://datatracker.ietf.org/doc/html/rfc7159 integers outside > > of range [-(2**53)+1, (2**53)-1] are not > > JSON-safe > > > > * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, > > despite they seems to be allowed by Avro spec. > > So, JSON representation of Avro data is incomplete > > > > * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined > > by spec. For unknown reasons you decided to > > insert essentially whole description of JSON representation of Avro data > > into one small paragraph even before type > > system is fully described. Please use terms only after their definition > > > > * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as optional > > or required. Do you ever read your spec? > > > > * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word "namespace" > > is marked using <em>, not <code>, and we > > can see this in rendered version. This is very stupid typographic bug, > > which is immediately obvious to anybody reading > > this document, even to non-technical people > > > > * [Bug] [Line 200]. "a single attribute". As we see in provided example, > > "default" is allowed, too. What is meaning of > > this "default" attribute? And how its meaning differs from meaning of > > "default" key in field description? (Same for > > maps) > > > > * [Bug] [Line 238]. "declares a schema which may be either a null or > > string". Lie. Schema is ["null", "string"]. > > *Value* may be a null or string. Please check that you don't confuse types > > (schemas) with value through whole your spec > > > > * [Very opinionated opinion :)] [Line 235]. I don't like your unions at > > all. I'm coming from languages like Haskell and > > Rust, where true sum types are supported. They are similar to your unions, > > but their alternatives are named. > > Alternatives are identified by their names, so there is no restriction on > > duplicate types. So there is no need for very > > unnatural restriction "Unions may not contain more than one schema with the > > same type, except for the named types > > record, fixed and enum" > > > > * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, > > providing alternate names for this enum". You > > mean "fixed", right? So, you copy-pasted section on enums? Do you ever read > > your spec from end to end at least one time? > > > > * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per > > value". Is zero allowed? > > > > * [Bug] [Line 292]. "The null namespace may not be used in a dot-separated > > sequence of names". You defined previously > > null namespace as a empty string instead of *whole* namespace. I. e. null > > namespace is lack of namespace (i. e. lack of > > whole dot-separated sequence). So there is no sense in speaking on using > > null namespace in dot-separated sequence. You > > probably mean that one should not use empty string in dot-separated sequence > > > > * [Bug] [Line 374]. "Deserializing data into a newer schema is accomplished > > by specifying an additional schema, the > > results of which are described in Schema Resolution". Term "additional > > schema" is vague here. I would say so: > > "Deserializing data into a newer schema is accomplished by using an > > algorithm described in Schema Resolution" > > > > * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to > > read Avro data with a schema that does not..." > > The whole paragraph is very vague. At first reading I thought that it is > > about schema resolution. After several > > attempts to understand it I finally understood that the paragraph is about > > reading attempts without original writer > > schema available at all. I propose removing whole paragraph or rewriting it > > completely > > > > * [Bug] [Line 385]. "For example, int and long are always serialized the > > same way". What this means? You probably mean > > that *same* int and long (i. e. int and long, which are numerically > > identical) serialized the same way. > > > > * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. > > Do you mean no bytes at all? Or null bytes, > > i. e. some undefined number of null bytes? (Of course, I understand that > > you mean the first variant, but I still don't > > like the phrase) > > > > * [Bug] [Line 417]. "int and long values are written..." Is canonical (i. > > e. smallest) encoding of numbers required? > > Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307 > > > > I still think canonical representations should be required. The more forms > > of encoding you allow the bigger is attack > > surface. > > > > Also, it would be desirable property for binary representations be equal > > when data is equal. It would be good if you > > guarantee this property at least for some subset of schemas (and of course, > > you should write explicitly for which > > schemas the property is guaranteed). Non-canonical representations break > > this property > > > > * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is > > sequence of encoded scalar values (don't confuse > > with code points). Unfortunately, not everybody knows this, and thus we see > > WTF-8 (i. e. encoding similar to UTF-8, but > > with standalone surrogates) available in places where proper UTF-8 should > > reside. So everywhere where the spec says > > "UTF-8" I propose to explicitly write that standalone surrogates are not > > allowed and that readers should fail if they > > find them (I prefer to place this sentence to introduction of the spec) > > > > * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions are > > practically an int, but theoretically a > > long". Wat? So, other implementations use int (as per spec), but C++ uses > > long, right? So, go fix C++ implementation to > > match spec and other implementations > > > > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a JSON > > null". There is no reasons to special-case > > nulls. This is additional requirement, which adds complexity to > > implementations without any reasons > > > > * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." It > > seems I found a real bug. :) Consider this > > schema: > > > > [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}] > > > > As well as I understand such schema fully allowed. Now consider this > > encoded value: {"map":{"b":0}}. What is it? Map or > > record named "map"? > > > > * [Bug] [Line 677]. "data is ordered by ascending numeric value". What > > about NaNs? > > > > * [Bug] [Line 682]. "compared lexicographically by Unicode code point". > > Replace with scalar values. UTF-8 consists of > > scalar values > > > > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for > > this file". I don't like this point. It > > implies that container files are usually not equal. Thus it is not possible > > to compare them bitwise to determine > > equality. So, in my Avro implementation I write null bytes instead of this > > marker (yes, this possibly means that my > > implementation is non-conforming) > > > > * [Opinion] [Line 717]. There is no any marker for end of container file. > > Thus there is no way to determine whether all > > data was written > > > > * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable to > > string". Wat? How these values are promoted? > > > > * [Bug] [Line 1153]. What implementation should do (when it does schema > > resolution)? It should first check that schemas > > match (and report any errors) and then read data? Or proceed straight to > > reading data? This is important distinction. > > For example, what happens when we attempt to read file container without > > data elements using schema resolution > > algorithm? (Are such container allowed, by the way?) In the first case > > scheme check should be performed. In the second > > such reading should always be successful. > > > > If you think the first case is correct, then the section should describe > > algorithm for determining matching of schemas > > separately from algorithm of actual reading data > > > > * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" > > instead of {"type":"int"}>>, right? > > > > * [Bug] [Line 1331]. "replace any escaped characters". Any? What about > > "a\"b"? It is impossible to replace :) > > > > ---- > > > > Some notes about my task. I want to implement this: > > https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . > > I. e. I want to have shell utils in Linux, > > which exchange some structured data. Here is how I chose format for > > representing that data. > > > > * I want format to be binary, not textual, this rules out JSON, XML, etc > > * I want format to be typed, this rules out CBOR etc > > * I want format to have support for proper sum types (similar to > > Haskell's), this rules out Microsoft Bond. As well as > > I understand this also rules out using GVariants, proposed in above > > mentioned article. And this rules out Protobuf: > > Protobuf has support for sum types (they are named OneOf), but this OneOfs > > are always optional (speaking in Avro > > language: you always get ["null", "int", "bool"] instead of ["int", "bool"]) > > * I want format to have support for recursive types, this rules out Bare ( > > baremessages.org ) > > > > So, we have not so many formats left. Avro and possibly a few more. And I > > chose Avro. And I really like it. Because: > > > > * It is very compact > > * It has very elegant way to support schema evolution (as opposed to > > Protobuf, where fields are tagged, i. e. you trade > > space efficiency for schema evolution) > > * It has container format with schema attached > > * You don't need to write items count to container header (good for > > streaming) > > > > So, Avro is simply *best* for my task. But then I discovered its problems > > (listed above). How it is happened that such > > good format has so bad spec? How it is happened that *best* format for this > > task happened to be so bad? What this says > > about our industry? > > > > == > > Askar Safin > > http://safinaskar.com > > https://sr.ht/~safinaskar > > https://github.com/safinaskar