Hello! Thanks, Dan, for your calm and measured response -- you've given some excellent advice on how someone can make a positive contribution to the project and the spec.
Askar, your approach in presenting your specification review should have been more constructive: it isn't useful to cross-post across lists and recipients indiscriminately, and please remain civil when pointing out (potential) errors. I understand that the Avro specification isn't perfect, but I think the community has always been open to making changes and providing clarifications. I'm glad that you've found some things to appreciate in the Avro project, and I hope you choose to contribute in making the improvements that you'd like to see. Ryan [1] https://www.apache.org/foundation/policies/conduct On Sun, Feb 13, 2022 at 6:05 PM Dan Schmitt <dan.schm...@gmail.com> wrote: > > I will admit the spec is likely weak around unicode/utf encoding, > (as the serialization of strings to json isn't consistent across all > language bindings) but I file the ticket against implementations > and write test cases rather than make guesses at the spec > wording and make demands without knowing it's a real issue. > > On Sat, Feb 12, 2022 at 6:55 PM Dan Schmitt <dan.schm...@gmail.com> wrote: > > > > Generally the "I'm going to lump all my complaints into > > one big bug" is a good way to get them ignored. > > > > I'll skip "the design is wrong and it should change because > > I don't like it" and cite "it's used everywhere with lots of > > implementations so you can't change it in an incompatible way". > > > > I'll skip obvious typos and suggest you catch more flies with > > honey than vinegar, and you can work a git repository and > > make a pull request for those and they'd be fixed fast. (Or > > read 2 or 3 of the 8 implementations if the phrasing is confusing > > to you.) > > > > I'll make some suggestions on the technical details > > So, specifically: > > > > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker > > for this file". I don't like this point. It > > implies that container files are usually not equal. Thus it is not > > possible to compare them bitwise to determine > > equality. So, in my Avro implementation I write null bytes instead of > > this marker (yes, this possibly means that my > > implementation is non-conforming) > > * [Opinion] [Line 717]. There is no any marker for end of container > > file. Thus there is no way to determine whether all > > data was written > > > > If you use the sync marker, you don't need an end of container marker. > > (Flush the sync and container block map with new data after the new > > block is written, if you have the metadata and block list you know that > > much is complete and written for you to read, if you read the metadata > > and your sync byte marker is wrong, go re-read/continue read.) > > > > * [Very theoretical bug, possible even security-related] [Line 435]. > > Since you have a test case that proves it doesn't crash systems, it's > > sort of not a bug right? You could at the test case to the test suite. > > > > * [Bug] [Line 572]. "Currently for C/C++ implementations, the > > positions are practically an int, but theoretically a > > long". Wat? So, other implementations use int (as per spec), but C++ > > uses long, right? So, go fix C++ implementation to > > match spec and other implementations > > > > This is not a bug, but an acknowledgement that the C/C++ offsets > > are internally implemented via pointer math to be efficient but if > > you try to read in enough data that a long offset makes sense, > > you will be sad/run out of memory. That the internal implementation > > for C/C++ supports the minimum required by the specification. > > > > * [Bug] [Line 385]. "For example, int and long are always serialized > > the same way". What this means? You probably mean > > that *same* int and long (i. e. int and long, which are numerically > > identical) serialized the same way. > > > > That rewrite is wrong. Your wording would allow serialization to be > > altered by value (e.g. it would be allowable to use big endian storage > > for odd numbers and little endian for even as each same int and long > > would be serialized the same way.) > > > > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a > > JSON null". There is no reasons to special-case > > nulls. This is additional requirement, which adds complexity to > > implementations without any reasons > > > > You are making assumptions about implementation encoding of null. > > A C developer would say writing 0x00 to the file that you will read back > > later is fine for null or false or 0. > > > > * [Bug] [Line 417]. - can't take action, the request breaks compatibility. > > > > * [Bug] [Line 292]. "The null namespace may not be used in a > > dot-separated sequence of names". You defined previously > > null namespace as a empty string instead of *whole* namespace. I. e. > > null namespace is lack of namespace (i. e. lack of > > whole dot-separated sequence). So there is no sense in speaking on > > using null namespace in dot-separated sequence. You > > probably mean that one should not use empty string in dot-separated sequence > > > > That doesn't read as a bug at all to me, you have introduce a "whole > > namespace" and only then does it not make sense > > > > a.b.c - a is the top level namespace, b is a namespace in a, c is a > > namespace in b. > > > > Probably easier to read up on C++ namespaces and nesting and full > > specification to see how your introduction of "whole namespace" where > > it doesn't exist is what is causing your confusion. > > > > I look forward to more bug reports and pull requests to iron out typos > > and standardize existing practice (you might want to read through > > other implementations to make sure your suggestions are useful to > > the community at large.) > > > > On Sat, Feb 12, 2022 at 5:22 PM Askar Safin <safinas...@mail.ru> wrote: > > > > > > Hi. I'm writing my own Avro implementation in Rust for personal use. > > > During this work I found a lot of issues in Avro > > > spec (the list follows). > > > > > > I send this mail not only to user and dev mailing lists of Avro, but also > > > to Apache community list, Kafka list and 3 > > > semi-randomly chosen Materialize employees. Because I want to draw > > > attention to this problems. I hope this wider > > > community helps Avro fix their issues and possible will give necessary > > > resources. > > > > > > As well as I understand Avro is used in Kafka. And Kafka, according to > > > their site, is used in "80% of all Fortune 100 > > > companies". So Avro is critical piece of infrastructure of humanity. It > > > should be absolutely perfect (and so I list > > > even very small bugs). But it is not perfect. > > > > > > Some of items in this list are (small and big) bugs, some are typos, some > > > are my objections to the design. Some are > > > fixable while keeping compatibility, some are not. I don't want to spend > > > my time to report them as separate bugs, but > > > you can try to convince me to do so. > > > > > > I created this list simply by reading the spec from end to end (I skipped > > > sections on RPC and logical types). And I > > > even didn't look at implementations! > > > > > > I write this is hope to help Avro. > > > > > > I think big audit of spec and its implementations should be done. > > > > > > All line numbers apply to spec.xml from tag release-1.11.0 (i. e. > > > https://github.com/apache/avro/blob/release-1.11.0/doc/src/content/xdocs/spec.xml > > > ). As well as I understand this tag > > > corresponds to currently published version at > > > https://avro.apache.org/docs/current/spec.html . > > > > > > So, here we go. > > > > > > * [Opinion] [No line]. In Avro one have to define named records inside > > > each other like so: > > > > > > { "type": "record", "name": "a", "fields": > > > [{"name":"b","type":{"type":"record","name":"c",...}}] } > > > > > > This is very unnatural. In popular programming languages one usually > > > define named record next to each other, not one > > > inside other. Such representation is not handy to deal programmatically. > > > In my implementation I have to convert this > > > representation to usual form "root type + list of named types" right > > > after reading JSON and convert back just before > > > writing. > > > > > > * [Opinion] [No line]. In this list you will see a lot of questions on > > > Avro schema (encoded as JSON). Good JSON schema > > > ( https://json-schema.org/ ) would resolve many of them > > > > > > * [Seems to be bug] [Line 49]. "derived type name" is vague term. In > > > fact, in whole spec phrase "type name" is used > > > very vaguely. Sometimes it means strings like "record" and sometimes it > > > means names of named types. I propose to define > > > in very beginning of the spec terms for primitive types, terms for > > > strings like "record" and terms for names of defined > > > types. Here is one possible way to do this: name strings like "record" > > > and "fixed" "type kinds" and never name them > > > type names, thus reserving term "type name" to named types only (and > > > possibly to primitive types). > > > > > > This issue already caused problems: look, for example, to this problems > > > with {"type":"record","name":"record",...}: > > > https://lists.apache.org/thread/0wmgyx6z69gy07lvj9ndko75752b8cn2 . > > > > > > * [Opinion] [Line 58]. There is no primitive type for unsigned 64 bit > > > integers. Such type is present in languages such > > > as C and Rust > > > > > > * [Very theoretical bug, possible even security-related] [Line 435]. "The > > > float is converted into a 32-bit integer > > > using a method equivalent to Java's floatToIntBits and then encoded in > > > little-endian format". If we click at provided > > > link, we will see that this Java function does NaN normalization. I think > > > NaN normalization is good thing. But I think > > > this quite possible spec implementers overlooked this NaN normalization > > > requirement. So I propose: write explicitly > > > directly in Avro spec that NaN are normalized. Audit all Avro > > > implementations: whether they actually implemented this > > > requirement. Create tests, which will actually test this requirement. > > > > > > Also I don't know whether bit pattern provided in that Java doc > > > (0x7fc00000) is quiet NaN or signaling. If it is > > > signaling, this is very bad. > > > > > > As well as I understand if you will configure your FPU particular way > > > than merely copying signaling NaN from one place > > > to another will abort your program. So, if your FPU is configured certain > > > way then feeding particular binary Avro data > > > to a program can crash it. I. e. this is security problem. So a reader > > > should be careful to check whether input data is > > > signaling NaN *before* storing it in floating point registers. > > > > > > I checked whether manipulating signaling NaN can actually crash a program > > > in default settings in Windows and Linux. And > > > it turned out that a program will not crash. Still I think signaling NaN > > > should be handled carefully. > > > > > > Write to spec that writers should normalize NaNs, that readers should > > > reject non-normalized NaNs and that readers > > > should be careful not to store incoming floating number to floating-point > > > variable before its sanitizing. Write that > > > this is security issue. > > > > > > * [Opinion] [Line 68]. "unicode character sequence". As well as I > > > understand Unicode character sequence means sequence > > > of Unicode scalar values. Note that scalar value is not same thing as > > > code point. Unfortunately, some people don't know > > > this, so please write explicitly: "this is sequence of scalar values, not > > > code points", to make sure implementations > > > will be correct > > > > > > * [Bug] [Line 71]. "Primitive types have no specified attributes". This > > > is lie. At line 1527 you specify logical type > > > based on primitive type int. Thus you specify particular meaning of > > > attribute "logicalType" for primitive type "int". > > > Be careful at your wording. The spec should be rock-solid > > > > > > * [Opinion] [Line 96]. "aliases: a JSON array of strings, providing > > > alternate names for this record (optional)". Is > > > empty array allowed? :) Are duplicate aliases allowed? :) Yes, you may > > > say this is nitpicking, but I don't think so. > > > Avro has important place in our infrastructure, so everything is > > > important. Think carefully whether empty list (and > > > duplicates) is allowed everywhere in the spec where you see some kind of > > > list. I think empty arrays (and duplicates) > > > should be disallowed in this particular case. Because the more things we > > > allow, the bigger is attack surface > > > > > > * [Bug] [Line 98]. "fields: a JSON array, listing fields". How many > > > fields allowed? Already reported by me at > > > https://issues.apache.org/jira/browse/AVRO-3279 > > > > > > * [Bug] [Line 235]. "Unions". How many variants in union allowed? Already > > > reported by me at > > > https://issues.apache.org/jira/browse/AVRO-3280 > > > > > > * [Bug] [Line 101]. "name: a JSON string providing the name of the field > > > (required), and". Word "and" usually placed > > > immediately before last item in sequence. The text here looks like item > > > "doc" was last, but then the text was edited > > > not carefully. This is very stupid typographic issue which shows that > > > authors are not careful about spec quality. Also, > > > the spec is not consistent on placing dots after items (this applies to > > > whole spec). Sometimes I see nothing in the end > > > of item, sometimes "." and sometimes ";" > > > > > > * [Opinion] [Line 106]. "default: A default value for..." What follows is > > > essentially description of JSON > > > representation of Avro datum (except for unions). So, you managed to put > > > very important part of your spec directly into > > > one paragraph into second level bullet point?! > > > > > > * [Opinion] [Line 112]. "Default values for union fields correspond to > > > the first schema in the union". This phrase is > > > difference between JSON encoding for Avro data and JSON encoding for > > > default field. And, of course, presence of this > > > difference is design bug > > > > > > * [Opinion] [Line 113]. "Default values for bytes and fixed fields are > > > JSON strings, where Unicode code points 0-255 > > > are mapped to unsigned 8-bit byte values 0-255". Wat? This is very > > > unnatural encoding. You misuse JSON string. They are > > > for strings, they are not for binary data. You should use array of > > > numbers instead. I. e. encode bytes 0x0f 0x02 as > > > [15, 2]. Moreover, how you will encode null bytes? "\u0000", right? C > > > programs have difficulties with such strings > > > > > > * [Bug] [Line 123]. Okey, so longs are encoded as JSON integers. But what > > > if given long is not a JSON-safe integer? As > > > we know from https://datatracker.ietf.org/doc/html/rfc7159 integers > > > outside of range [-(2**53)+1, (2**53)-1] are not > > > JSON-safe > > > > > > * [Bug] [Line 123]. Infinities and NaNs cannot be represented in JSON, > > > despite they seems to be allowed by Avro spec. > > > So, JSON representation of Avro data is incomplete > > > > > > * [Bug] [Line 128]. "enum". What is enum? :) This term is not yet defined > > > by spec. For unknown reasons you decided to > > > insert essentially whole description of JSON representation of Avro data > > > into one small paragraph even before type > > > system is fully described. Please use terms only after their definition > > > > > > * [Stupid bug] [Line 168]. "namespace". Namespace is not marked as > > > optional or required. Do you ever read your spec? > > > > > > * [Even more stupid bug] [Line 168]. "<em>namespace</em>". Word > > > "namespace" is marked using <em>, not <code>, and we > > > can see this in rendered version. This is very stupid typographic bug, > > > which is immediately obvious to anybody reading > > > this document, even to non-technical people > > > > > > * [Bug] [Line 200]. "a single attribute". As we see in provided example, > > > "default" is allowed, too. What is meaning of > > > this "default" attribute? And how its meaning differs from meaning of > > > "default" key in field description? (Same for > > > maps) > > > > > > * [Bug] [Line 238]. "declares a schema which may be either a null or > > > string". Lie. Schema is ["null", "string"]. > > > *Value* may be a null or string. Please check that you don't confuse > > > types (schemas) with value through whole your spec > > > > > > * [Very opinionated opinion :)] [Line 235]. I don't like your unions at > > > all. I'm coming from languages like Haskell and > > > Rust, where true sum types are supported. They are similar to your > > > unions, but their alternatives are named. > > > Alternatives are identified by their names, so there is no restriction on > > > duplicate types. So there is no need for very > > > unnatural restriction "Unions may not contain more than one schema with > > > the same type, except for the named types > > > record, fixed and enum" > > > > > > * [Absolutely stupid bug] [Line 261]. "aliases: a JSON array of strings, > > > providing alternate names for this enum". You > > > mean "fixed", right? So, you copy-pasted section on enums? Do you ever > > > read your spec from end to end at least one time? > > > > > > * [Bug] [Line 265]. "size: an integer, specifying the number of bytes per > > > value". Is zero allowed? > > > > > > * [Bug] [Line 292]. "The null namespace may not be used in a > > > dot-separated sequence of names". You defined previously > > > null namespace as a empty string instead of *whole* namespace. I. e. null > > > namespace is lack of namespace (i. e. lack of > > > whole dot-separated sequence). So there is no sense in speaking on using > > > null namespace in dot-separated sequence. You > > > probably mean that one should not use empty string in dot-separated > > > sequence > > > > > > * [Bug] [Line 374]. "Deserializing data into a newer schema is > > > accomplished by specifying an additional schema, the > > > results of which are described in Schema Resolution". Term "additional > > > schema" is vague here. I would say so: > > > "Deserializing data into a newer schema is accomplished by using an > > > algorithm described in Schema Resolution" > > > > > > * [Bug] [Line 380]. "Therefore, it is possible, though not advisable, to > > > read Avro data with a schema that does not..." > > > The whole paragraph is very vague. At first reading I thought that it is > > > about schema resolution. After several > > > attempts to understand it I finally understood that the paragraph is > > > about reading attempts without original writer > > > schema available at all. I propose removing whole paragraph or rewriting > > > it completely > > > > > > * [Bug] [Line 385]. "For example, int and long are always serialized the > > > same way". What this means? You probably mean > > > that *same* int and long (i. e. int and long, which are numerically > > > identical) serialized the same way. > > > > > > * [Bug] [Line 413]. "null is written as zero bytes". The phrase is vague. > > > Do you mean no bytes at all? Or null bytes, > > > i. e. some undefined number of null bytes? (Of course, I understand that > > > you mean the first variant, but I still don't > > > like the phrase) > > > > > > * [Bug] [Line 417]. "int and long values are written..." Is canonical (i. > > > e. smallest) encoding of numbers required? > > > Already reported by me at https://issues.apache.org/jira/browse/AVRO-3307 > > > > > > I still think canonical representations should be required. The more > > > forms of encoding you allow the bigger is attack > > > surface. > > > > > > Also, it would be desirable property for binary representations be equal > > > when data is equal. It would be good if you > > > guarantee this property at least for some subset of schemas (and of > > > course, you should write explicitly for which > > > schemas the property is guaranteed). Non-canonical representations break > > > this property > > > > > > * [Bug] [Line 446]. "bytes of UTF-8". As well as I understand UTF-8 is > > > sequence of encoded scalar values (don't confuse > > > with code points). Unfortunately, not everybody knows this, and thus we > > > see WTF-8 (i. e. encoding similar to UTF-8, but > > > with standalone surrogates) available in places where proper UTF-8 should > > > reside. So everywhere where the spec says > > > "UTF-8" I propose to explicitly write that standalone surrogates are not > > > allowed and that readers should fail if they > > > find them (I prefer to place this sentence to introduction of the spec) > > > > > > * [Bug] [Line 572]. "Currently for C/C++ implementations, the positions > > > are practically an int, but theoretically a > > > long". Wat? So, other implementations use int (as per spec), but C++ uses > > > long, right? So, go fix C++ implementation to > > > match spec and other implementations > > > > > > * [Opinion] [Line 596]. "if its type is null, then it is encoded as a > > > JSON null". There is no reasons to special-case > > > nulls. This is additional requirement, which adds complexity to > > > implementations without any reasons > > > > > > * [*Real* bug] [Line 598]. "otherwise it is encoded as a JSON object..." > > > It seems I found a real bug. :) Consider this > > > schema: > > > > > > [{"type":"record","name":"map","fields":[{"name":"b","type":"int"}]},{"type":"map","values":"int"}] > > > > > > As well as I understand such schema fully allowed. Now consider this > > > encoded value: {"map":{"b":0}}. What is it? Map or > > > record named "map"? > > > > > > * [Bug] [Line 677]. "data is ordered by ascending numeric value". What > > > about NaNs? > > > > > > * [Bug] [Line 682]. "compared lexicographically by Unicode code point". > > > Replace with scalar values. UTF-8 consists of > > > scalar values > > > > > > * [Opinion] [Line 737]. "The 16-byte, randomly-generated sync marker for > > > this file". I don't like this point. It > > > implies that container files are usually not equal. Thus it is not > > > possible to compare them bitwise to determine > > > equality. So, in my Avro implementation I write null bytes instead of > > > this marker (yes, this possibly means that my > > > implementation is non-conforming) > > > > > > * [Opinion] [Line 717]. There is no any marker for end of container file. > > > Thus there is no way to determine whether all > > > data was written > > > > > > * [Bug] [Line 1186]. "string is promotable to bytes. bytes is promotable > > > to string". Wat? How these values are promoted? > > > > > > * [Bug] [Line 1153]. What implementation should do (when it does schema > > > resolution)? It should first check that schemas > > > match (and report any errors) and then read data? Or proceed straight to > > > reading data? This is important distinction. > > > For example, what happens when we attempt to read file container without > > > data elements using schema resolution > > > algorithm? (Are such container allowed, by the way?) In the first case > > > scheme check should be performed. In the second > > > such reading should always be successful. > > > > > > If you think the first case is correct, then the section should describe > > > algorithm for determining matching of schemas > > > separately from algorithm of actual reading data > > > > > > * [Bug] [Line 1308]. <<int instead of {"type":"int"}>>. You mean <<"int" > > > instead of {"type":"int"}>>, right? > > > > > > * [Bug] [Line 1331]. "replace any escaped characters". Any? What about > > > "a\"b"? It is impossible to replace :) > > > > > > ---- > > > > > > Some notes about my task. I want to implement this: > > > https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shell-pipeline/ . > > > I. e. I want to have shell utils in Linux, > > > which exchange some structured data. Here is how I chose format for > > > representing that data. > > > > > > * I want format to be binary, not textual, this rules out JSON, XML, etc > > > * I want format to be typed, this rules out CBOR etc > > > * I want format to have support for proper sum types (similar to > > > Haskell's), this rules out Microsoft Bond. As well as > > > I understand this also rules out using GVariants, proposed in above > > > mentioned article. And this rules out Protobuf: > > > Protobuf has support for sum types (they are named OneOf), but this > > > OneOfs are always optional (speaking in Avro > > > language: you always get ["null", "int", "bool"] instead of ["int", > > > "bool"]) > > > * I want format to have support for recursive types, this rules out Bare > > > ( baremessages.org ) > > > > > > So, we have not so many formats left. Avro and possibly a few more. And I > > > chose Avro. And I really like it. Because: > > > > > > * It is very compact > > > * It has very elegant way to support schema evolution (as opposed to > > > Protobuf, where fields are tagged, i. e. you trade > > > space efficiency for schema evolution) > > > * It has container format with schema attached > > > * You don't need to write items count to container header (good for > > > streaming) > > > > > > So, Avro is simply *best* for my task. But then I discovered its problems > > > (listed above). How it is happened that such > > > good format has so bad spec? How it is happened that *best* format for > > > this task happened to be so bad? What this says > > > about our industry? > > > > > > == > > > Askar Safin > > > http://safinaskar.com > > > https://sr.ht/~safinaskar > > > https://github.com/safinaskar